Bug 5948 - Oops when accessing SATA with device mapper on AMD 64 X2
Summary: Oops when accessing SATA with device mapper on AMD 64 X2
Status: CLOSED CODE_FIX
Alias: None
Product: IO/Storage
Classification: Unclassified
Component: LVM2/DM (show other bugs)
Hardware: i386 Linux
: P2 high
Assignee: Alasdair G Kergon
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2006-01-24 06:46 UTC by Brian Daniels
Modified: 2007-11-13 04:35 UTC (History)
6 users (show)

See Also:
Kernel Version: 2.6.15.1smp
Subsystem:
Regression: ---
Bisected commit-id:


Attachments

Description Brian Daniels 2006-01-24 06:46:31 UTC
Most recent kernel where this bug did not occur: unknown
Distribution: Fedora Core 3
Hardware Environment: MSI K8N Neo2 Platinum, AMD 64 X2 4200+, nForce3 chipset
Software Environment: Occurs with 2.6.12-1.1381_FC3smp, 2.6.15, 2.6.15.1
Problem Description:
My system has its two primary drives running from the SATA 1&2 connections on
the MB in a RAID 1 array.  I have another HD attached to IDE2 on the MB that I
use to run backups to.

This configuration has been running fine for over a year.  I recently upgraded
the CPU to an AMD64 X2 4200+ and switched to a SMP kernel (2.6.12-1.1381_FC3smp).

My backup script uses tar to copy files from the main filesystem to hdc, which
is mounted as a cryptographic filesystem.  I takes about 3-4 hours to complete.
 Since the upgrade, the backup will fail at some point during the process with
the Oops below.  The time to failure seems random, sometimes in the first 15
min, sometimes 2 hours, etc.

This intially occurred with the stock 2.6.12-1.1381_FC3smp kernel.  I have since
 tried compiling 2.6.15 and 2.6.15.1, which both fail in the same way.  Running
with a non-smp kernel, I have no problems.  

The Oops is sometimes on CPU 0, sometimes on CPU 1.  The system is completely
stable running on the SATA drives alone, even under high load.  After the Oops
the system is still functional, but you cannot unmount hdc, and shutdown will
not work.

Jan 23 23:26:51 hotblue kernel: Unable to handle kernel NULL pointer dereference
at 0000000000000007 RIP:
Jan 23 23:26:51 hotblue kernel: <ffffffff80220175>{blk_recount_segments+133}
Jan 23 23:26:51 hotblue kernel: PGD 1e0bf067 PUD 1e0c3067 PMD 0
Jan 23 23:26:51 hotblue kernel: Oops: 0000 [1] SMP
Jan 23 23:26:51 hotblue kernel: CPU 1
Jan 23 23:26:51 hotblue kernel: Modules linked in: sha256 aes_x86_64 dm_crypt
nvidia parport_pc lp parport w83627hf hwmon_vid eeprom i2c_isa ipt_REJECT
ipt_state ip_conntrack iptable_filter ip_tables nls_utf8 ntfs vfat fat video
button battery ac powermate ohci1394 ieee1394 ohci_hcd ehci_hcd i2c_nforce2
i2c_core snd_intel8x0 snd_ac97_codec snd_ac97_bus snd_pcm_oss snd_mixer_oss
snd_pcm snd_timer snd soundcore snd_page_alloc r8169 forcedeth floppy dm_mod
sata_nv libata sd_mod scsi_mod
Jan 23 23:26:51 hotblue kernel: Pid: 4704, comm: tar Tainted: P      2.6.15.1smp #1
Jan 23 23:26:51 hotblue kernel: RIP: 0010:[<ffffffff80220175>]
<ffffffff80220175>{blk_recount_segments+133}
Jan 23 23:26:51 hotblue kernel: RSP: 0018:ffff81001e0d54c8  EFLAGS: 00010202
Jan 23 23:26:51 hotblue kernel: RAX: 0000000000000000 RBX: 8e38e38e38e38e39 RCX:
000000000000000d
Jan 23 23:26:51 hotblue kernel: RDX: 0000000000000000 RSI: 0000000000000000 RDI:
ffff81003f8146b8
Jan 23 23:26:51 hotblue kernel: RBP: 0000000000000000 R08: 0000000000000000 R09:
0000000000000000
Jan 23 23:26:51 hotblue kernel: R10: ffff81001263a000 R11: 0000000000000000 R12:
ffff81002abcef40
Jan 23 23:26:51 hotblue kernel: R13: 0000000000020000 R14: 00000000000000d0 R15:
0000000000000000
Jan 23 23:26:51 hotblue kernel: FS:  00002aaaaaad5b00(0000)
GS:ffffffff8053c880(0000) knlGS:0000000055921060
Jan 23 23:26:51 hotblue kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
Jan 23 23:26:51 hotblue kernel: CR2: 0000000000000007 CR3: 000000001eeba000 CR4:
00000000000006e0
Jan 23 23:26:51 hotblue kernel: Process tar (pid: 4704, threadinfo
ffff81001e0d4000, task ffff81003e932040)
Jan 23 23:26:51 hotblue kernel: Stack: 000d000000000048 000000013ffb7fc0
0000000100010010 0000000000000000
Jan 23 23:26:51 hotblue kernel:        ffff81003f8146b8 ffff81002abcef40
ffff8100225ba6c0 ffff81003f8146b8
Jan 23 23:26:51 hotblue kernel:        0000000000020000 00000000000000d0
Jan 23 23:26:52 hotblue kernel: Call Trace:<ffffffff80189c86>{bio_clone+166}
<ffffffff886cc06e>{:dm_crypt:crypt_map+238}
Jan 23 23:26:52 hotblue kernel:        <ffffffff880472a1>{:dm_mod:__map_bio+81}
<ffffffff88047629>{:dm_mod:__split_bio+361}
Jan 23 23:26:52 hotblue kernel:        <ffffffff80379541>{__down_read+49}
<ffffffff88047935>{:dm_mod:dm_request+261}
Jan 23 23:26:52 hotblue kernel:       
<ffffffff80222829>{generic_make_request+361}
<ffffffff8018997e>{bio_alloc_bioset+382}
Jan 23 23:26:52 hotblue kernel:        <ffffffff80222918>{submit_bio+216}
<ffffffff801abb52>{mpage_bio_submit+34}
Jan 23 23:26:52 hotblue kernel:        <ffffffff801ac49c>{__mpage_writepage+860}
<ffffffff801eceb0>{ext2_get_block+0}
Jan 23 23:26:52 hotblue kernel:       
<ffffffff80186465>{__find_get_block_slow+277} <ffffffff80379c09>{_read_unlock_irq+9}
Jan 23 23:26:52 hotblue kernel:       
<ffffffff8015e9ae>{find_get_pages_tag+158} <ffffffff801ac8c3>{mpage_writepages+579}
Jan 23 23:26:52 hotblue kernel:        <ffffffff801ed4d0>{ext2_writepage+0}
<ffffffff801eceb0>{ext2_get_block+0}
Jan 23 23:26:53 hotblue kernel:       
<ffffffff801aaf5b>{__writeback_single_inode+571}
<ffffffff801ecfb5>{ext2_get_block+261}
Jan 23 23:26:53 hotblue kernel:        <ffffffff801ab674>{sync_sb_inodes+500}
<ffffffff801ab9d6>{writeback_inodes+166}
Jan 23 23:26:53 hotblue kernel:       
<ffffffff80164ae4>{balance_dirty_pages_ratelimited+228}
Jan 23 23:26:53 hotblue kernel:       
<ffffffff801602b5>{generic_file_buffered_write+1285}
Jan 23 23:26:53 hotblue kernel:        <ffffffff8013a1e5>{current_fs_time+85}
<ffffffff801a049e>{inode_update_time+62}
Jan 23 23:26:53 hotblue kernel:       
<ffffffff801609aa>{__generic_file_aio_write_nolock+938}
Jan 23 23:26:53 hotblue kernel:        <ffffffff8012d8ab>{try_to_wake_up+1243}
<ffffffff80160a75>{__generic_file_write_nolock+149}
Jan 23 23:26:53 hotblue kernel:        <ffffffff8012f323>{__wake_up+67}
<ffffffff80191c68>{pipe_readv+840}
Jan 23 23:26:53 hotblue kernel:       
<ffffffff8014b9b0>{autoremove_wake_function+0}
<ffffffff80160c0e>{generic_file_write+126}
Jan 23 23:26:53 hotblue kernel:        <ffffffff801847f9>{vfs_write+233}
<ffffffff80184993>{sys_write+83}
Jan 23 23:26:53 hotblue kernel:        <ffffffff8010dc56>{system_call+126}
Jan 23 23:26:53 hotblue kernel:
Jan 23 23:26:53 hotblue kernel: Code: 49 0f b6 41 07 4c 89 c9 48 8b 04 c5 40 5e
54 80 48 2b 88 20
Jan 23 23:26:53 hotblue kernel: RIP <ffffffff80220175>{blk_recount_segments+133}
RSP <ffff81001e0d54c8>
Jan 23 23:26:53 hotblue kernel: CR2: 0000000000000007


Steps to reproduce:
Start a process that heavily accesses both the SATA drives and the IDE drive.
Comment 1 Martin J. Bligh 2006-01-24 07:28:25 UTC
Please read the bug filing instructions. No binary modules.
Comment 2 Brian Daniels 2006-01-24 08:42:50 UTC
Sigh.  Restarted system at runlevel 3, no nvidia modules.  Re-ran backup, 30 min
later:


Jan 24 11:21:43 hotblue kernel: Unable to handle kernel NULL pointer dereference
at 0000000000000007 RIP:
Jan 24 11:21:43 hotblue kernel: <ffffffff80220175>{blk_recount_segments+133}
Jan 24 11:21:43 hotblue kernel: PGD 3d007067 PUD 33891067 PMD 0
Jan 24 11:21:43 hotblue kernel: Oops: 0000 [1] SMP
Jan 24 11:21:43 hotblue kernel: CPU 0
Jan 24 11:21:43 hotblue kernel: Modules linked in: sha256 aes_x86_64 dm_crypt
parport_pc lp parport ipt_REJECT ipt_state ip_conntrack iptable_filter ip_tables
nls_utf8 ntfs vfat fat video button battery ac powermate ohci1394 ieee1394
ohci_hcd ehci_hcd i2c_nforce2 i2c_core snd_intel8x0 snd_ac97_codec snd_ac97_bus
snd_pcm_oss snd_mixer_oss snd_pcm snd_timer snd soundcore snd_page_alloc r8169
forcedeth floppy dm_mod sata_nv libata sd_mod scsi_mod
Jan 24 11:21:43 hotblue kernel: Pid: 4246, comm: tar Not tainted 2.6.15.1smp #1
Jan 24 11:21:43 hotblue kernel: RIP: 0010:[<ffffffff80220175>]
<ffffffff80220175>{blk_recount_segments+133}
Jan 24 11:21:43 hotblue kernel: RSP: 0018:ffff810033f7f4c8  EFLAGS: 00010202
Jan 24 11:21:43 hotblue kernel: RAX: 0000000000000000 RBX: 8e38e38e38e38e39 RCX:
000000000000000d
Jan 24 11:21:43 hotblue kernel: RDX: 0000000000000000 RSI: 0000000000000000 RDI:
ffff81003f867108
Jan 24 11:21:43 hotblue kernel: RBP: 0000000000000000 R08: 0000000000000000 R09:
0000000000000000
Jan 24 11:21:43 hotblue kernel: R10: ffff810016ee4000 R11: 0000000000000000 R12:
ffff81001d24f5c0
Jan 24 11:21:43 hotblue kernel: R13: 0000000000020000 R14: 00000000000001e0 R15:
0000000000000000
Jan 24 11:21:44 hotblue kernel: FS:  00002aaaaaad5b00(0000)
GS:ffffffff8053c800(0000) knlGS:0000000000000000
Jan 24 11:21:44 hotblue kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
Jan 24 11:21:44 hotblue kernel: CR2: 0000000000000007 CR3: 000000003392b000 CR4:
00000000000006e0
Jan 24 11:21:44 hotblue kernel: Process tar (pid: 4246, threadinfo
ffff810033f7e000, task ffff81003e436760)
Jan 24 11:21:44 hotblue kernel: Stack: 000d000000000048 000000013ffb7fc0
0000000100010010 0000000000000000
Jan 24 11:21:44 hotblue kernel:        ffff81003f867108 ffff81001d24f5c0
ffff81000a844540 ffff81003f867108
Jan 24 11:21:44 hotblue kernel:        0000000000020000 00000000000001e0
Jan 24 11:21:44 hotblue kernel: Call Trace:<ffffffff80189c86>{bio_clone+166}
<ffffffff8821306e>{:dm_crypt:crypt_map+238}
Jan 24 11:21:44 hotblue kernel:        <ffffffff880472a1>{:dm_mod:__map_bio+81}
<ffffffff88047629>{:dm_mod:__split_bio+361}
Jan 24 11:21:44 hotblue kernel:        <ffffffff80379541>{__down_read+49}
<ffffffff88047935>{:dm_mod:dm_request+261}
Jan 24 11:21:44 hotblue kernel:       
<ffffffff80222829>{generic_make_request+361}
<ffffffff8018997e>{bio_alloc_bioset+382}
Jan 24 11:21:44 hotblue kernel:        <ffffffff80222918>{submit_bio+216}
<ffffffff801abb52>{mpage_bio_submit+34}
Jan 24 11:21:44 hotblue kernel:        <ffffffff801ac49c>{__mpage_writepage+860}
<ffffffff801eceb0>{ext2_get_block+0}
Jan 24 11:21:44 hotblue kernel:        <ffffffff80379c09>{_read_unlock_irq+9}
<ffffffff8015e9ae>{find_get_pages_tag+158}
Jan 24 11:21:44 hotblue kernel:        <ffffffff801ac8c3>{mpage_writepages+579}
<ffffffff801ed4d0>{ext2_writepage+0}
Jan 24 11:21:44 hotblue kernel:        <ffffffff801eceb0>{ext2_get_block+0}
<ffffffff801aaf5b>{__writeback_single_inode+571}
Jan 24 11:21:44 hotblue kernel:        <ffffffff801ecfb5>{ext2_get_block+261}
<ffffffff801ab674>{sync_sb_inodes+500}
Jan 24 11:21:44 hotblue kernel:        <ffffffff801ab9d6>{writeback_inodes+166}
<ffffffff80164ae4>{balance_dirty_pages_ratelimited+228}
Jan 24 11:21:44 hotblue kernel:       
<ffffffff801602b5>{generic_file_buffered_write+1285}
Jan 24 11:21:44 hotblue kernel:        <ffffffff8013a1e5>{current_fs_time+85}
<ffffffff801a049e>{inode_update_time+62}
Jan 24 11:21:44 hotblue kernel:       
<ffffffff801609aa>{__generic_file_aio_write_nolock+938}
Jan 24 11:21:44 hotblue kernel:        <ffffffff8012d8ab>{try_to_wake_up+1243}
<ffffffff80160a75>{__generic_file_write_nolock+149}
Jan 24 11:21:44 hotblue kernel:        <ffffffff8012f323>{__wake_up+67}
<ffffffff80191c68>{pipe_readv+840}
Jan 24 11:21:44 hotblue kernel:       
<ffffffff8014b9b0>{autoremove_wake_function+0}
<ffffffff80160c0e>{generic_file_write+126}
Jan 24 11:21:44 hotblue kernel:        <ffffffff801847f9>{vfs_write+233}
<ffffffff80184993>{sys_write+83}
Jan 24 11:21:44 hotblue kernel:        <ffffffff8010dc56>{system_call+126}
Jan 24 11:21:44 hotblue kernel:
Jan 24 11:21:44 hotblue kernel: Code: 49 0f b6 41 07 4c 89 c9 48 8b 04 c5 40 5e
54 80 48 2b 88 20
Jan 24 11:21:44 hotblue kernel: RIP <ffffffff80220175>{blk_recount_segments+133}
RSP <ffff810033f7f4c8>
Jan 24 11:21:44 hotblue kernel: CR2: 0000000000000007
Comment 3 Bartlomiej Zolnierkiewicz 2006-01-24 09:52:45 UTC
looks like device-mapper or block layer problem...
Comment 4 Andrew Morton 2006-01-24 10:42:52 UTC
It's possibly a devicemapper bug.   Are you able to reproduce
it on a system which has never had the nvidia module loaded?
Comment 5 Brian Daniels 2006-01-24 10:49:01 UTC
I only have the one system.  If you mean can I repro it on a clean boot never
loading that module, yes.  If you mean literally never having had the nvidia
code on the system since install, it's too late for me.  

Is there something I can clear or remove to test the devicemapper issue?  
Comment 6 Alasdair G Kergon 2006-01-25 07:26:07 UTC
Please can you tell us your device-mapper configuration by running:

dmsetup info -c
dmsetup table
dmsetup status

NB Your password may appear coded into a long hex string in the output on the
'crypt' line - if you want to keep the password secret, delete that hex string
before attaching the output here!
Comment 7 Alasdair G Kergon 2006-01-25 07:34:27 UTC
Also please tell us any mount or filesystem options you're using.
(eg if it's ext3, journal=? data=? etc.)
Comment 8 Brian Daniels 2006-01-25 07:42:39 UTC
# dmsetup info -c
Name             Maj Min Stat Open Targ Event  UUID
volume1          253   0 L--w    1    1      0

# dmsetup table
volume1: 0 118800612 crypt aes-cbc-essiv:sha256 
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX 0 22:1 0

# dmsetup status
volume1: 0 118800612 crypt

Hex string replaced with X's - thanks for that caution!
Filesystems on the main SATA hd's are ext3.  The backup IDE hd is ext2 running
over an encrypted device.
Comment 9 Brian Daniels 2006-01-25 07:45:44 UTC
With everything mounted:
$mount
/dev/md1 on / type ext3 (rw)
none on /proc type proc (rw)
none on /sys type sysfs (rw)
none on /dev/pts type devpts (rw,gid=5,mode=620)
usbfs on /proc/bus/usb type usbfs (rw)
/dev/md0 on /boot type ext3 (rw)
none on /dev/shm type tmpfs (rw)
/dev/sdb1 on /mnt/scratch type vfat
(rw,noexec,nosuid,nodev,uid=500,gid=500,umask=000)
/dev/sda1 on /mnt/XP type ntfs (ro,noexec,nosuid,nodev,uid=500,gid=500)
none on /proc/sys/fs/binfmt_misc type binfmt_misc (rw)
/dev/mapper/volume1 on /mnt/crypto type ext2 (rw)
Comment 10 Brian Daniels 2006-01-27 11:02:04 UTC
Don't know if this will make any difference to your troubleshooting, but I tried
running the backup with 'taskset -c 1' to force processor affinity to CPU 1
only, and verified that all the tar processes were showing that CPU mask.  It
still failed with the same OOPS.
Comment 11 Brian Daniels 2006-01-30 12:01:05 UTC
Ok, scratch the bit about IDE being involved.  I purchased an IDE-to-SATA
converter and moved my IDE backup drive onto it.  Now all drives were connected
to the SATA bus.  I then ran a backup, and Oopsed out once more.

So, the problem is reproducible with no IDE involvement.

I re-verified that running a backup in non-SMP mode works.  It completed
sucessfully, but had one error:

hotblue kernel: hdc: drive_cmd: status=0x51 { DriveReady SeekComplete Error }
hotblue kernel: hdc: drive_cmd: error=0x04 { DriveStatusError }

So I'm wondering if the issue might only occur when the drive reports an error?
 That could explain the seeming randomness of the Oopses.
Comment 12 Brian Daniels 2006-02-03 06:01:03 UTC
Scratch that last.  The hdc error was after backup completion, when setting the
spindown parameter on the drive with hdparm.  The Ooopses have been happening
well before that point is reached.
Comment 13 Fredrik Ohrn 2007-01-30 14:58:11 UTC
I've just stumbled upon this oops as well, in a plain v2.6.19.2 kernel (no 
distro patches or nVidia binary drivers). My hardware is a Asus M2N-E 
motherboard (nForce MCP55 chipset) with a AMD 64 X2 5200+ CPU.

It happened when I was doing some benchmarking with iozone and is 100% 
reproducible if I boot with mem=128M, if I boot with mem=512M or higher the 
oops goes away.

Unable to handle kernel NULL pointer dereference at 0000000000000000 RIP: 
 [<ffffffff8022adcb>] blk_recount_segments+0x79/0x3c1
PGD 46a2067 PUD 6bbd067 PMD 0 
Oops: 0000 [1] SMP 
CPU 0 
Modules linked in: it87 hwmon_vid hwmon eeprom i2c_isa usbhid ohci_hcd ehci_hcd 
usbcore i2c_nforce2 snd_hda_intel snd_hda_codec snd_pcm_oss
 snd_mixer_oss snd_pcm snd_timer snd soundcore snd_page_alloc
Pid: 399, comm: kcryptd/0 Not tainted 2.6.19.2 #1
RIP: 0010:[<ffffffff8022adcb>]  [<ffffffff8022adcb>] blk_recount_segments+0x79/
0x3c1
RSP: 0018:ffff810007d73d18  EFLAGS: 00010246
RAX: 000000000000001c RBX: 0000000000000000 RCX: 0000000000000000
RDX: 0000000000000000 RSI: ffff8100063b3740 RDI: ffff810007d02cc8
RBP: 0000000000000000 R08: 0000000000000000 R09: 0000000000000286
R10: ffff810001963800 R11: ffffffff804a7cd8 R12: ffff8100067ccd40
R13: 0000000000000062 R14: ffff810007d02cc8 R15: 0000000000000000
FS:  00002b9ce609a520(0000) GS:ffffffff80508000(0000) knlGS:0000000000000000
CS:  0010 DS: 0018 ES: 0018 CR0: 000000008005003b
CR2: 0000000000000000 CR3: 0000000006806000 CR4: 00000000000006e0
Process kcryptd/0 (pid: 399, threadinfo ffff810007d72000, task ffff810007564100)
Stack:  001c810007564100 0000000000000001 0000000000000001 0000000000000000
 ffff8100063b3740 ffff810007d02cc8 ffff8100067ccd40 0000000000000062
 0000000000000620 0000000000080000 ffffffff802bc2cf ffff8100063b3740
Call Trace:
 [<ffffffff802bc2cf>] __bio_clone+0x78/0x8e
 [<ffffffff803e9fe2>] kcryptd_do_work+0x1d6/0x35f
 [<ffffffff8022b296>] __wake_up+0x35/0x4b
 [<ffffffff803e9e0c>] kcryptd_do_work+0x0/0x35f
 [<ffffffff80248d4d>] run_workqueue+0x9a/0xe1
 [<ffffffff80245974>] worker_thread+0x0/0x12e
 [<ffffffff8028c299>] keventd_create_kthread+0x0/0x5f
 [<ffffffff80245a70>] worker_thread+0xfc/0x12e
 [<ffffffff8027bf2f>] default_wake_function+0x0/0xe
 [<ffffffff8028c299>] keventd_create_kthread+0x0/0x5f
 [<ffffffff8027bf2f>] default_wake_function+0x0/0xe
 [<ffffffff8028c299>] keventd_create_kthread+0x0/0x5f
 [<ffffffff8022ef15>] kthread+0xc8/0xf1
 [<ffffffff80259ed8>] child_rip+0xa/0x12
 [<ffffffff8028c299>] keventd_create_kthread+0x0/0x5f
 [<ffffffff8022ee4d>] kthread+0x0/0xf1
 [<ffffffff80259ece>] child_rip+0x0/0x12


Code: 49 8b 10 48 c1 ea 33 48 89 d0 48 c1 e8 09 48 8b 04 c5 40 00 
RIP  [<ffffffff8022adcb>] blk_recount_segments+0x79/0x3c1
 RSP <ffff810007d73d18>
CR2: 0000000000000000

To reproduce the oops run this script (vg1 should be a LVM2 volume group with 2 
SATA drives):

#!/bin/bash

function run-test ()
{
        echo Testing stripe size $1, record size $2

        file=/root/result.stripe$1.record$2.$3.txt

# Setup
        lvcreate -i 2 -I $1 -l 17652 -n a1 vg1

        dev=/dev/mapper/vg1-a1
        size=`blockdev --getsize $dev`
        echo 0 $size crypt $3 aab0c395bf53747532444025109c0387 0 $dev 0 | 
dmsetup create vg1-a1-crypt

        dev=/dev/mapper/vg1-a1-crypt

        mkfs.xfs -f $dev > $file
        mount $dev /root/bench
        cd /root/bench

# Test
        /usr/local/bin/iozone -i 0 -C -l 10 -L 64 -r $2 -s 100m -S 64 -t 10 -u 
10 -x >> $file

# Cleanup
        cd /root
        umount /root/bench
        dmsetup remove $dev
        lvchange -a n vg1
        lvremove vg1/a1
}

run-test 64 4 aes-cbc-plain
Comment 14 Andrew Morton 2007-01-30 16:01:36 UTC
Add Jens & Christophe - dm-crypt and block are implicated...
Comment 15 Fredrik Ohrn 2007-02-01 11:44:36 UTC
I forgot to mention, when running the benchmark on the striped disks without encryption it works just fine, so dm-crypt is definitely an important part of the equation.

If there is any additional testing I can do to narrow things down I'd be happy to help, however the machine will be put into production on Monday so I would need instructions on what to do before that.
Comment 16 Fredrik Ohrn 2007-02-06 08:03:26 UTC
Turns out the oops in blk_recount_segments appears in the production 
environment as well, the callstack is different compared to when I did the 
benchmarking though.

This may be becasue I had to go back to 2.6.18.6 (it turned out that networking 
is broken in 2.6.19 and 2.6.20 when running at gigabit speed).

Another change is that we removed the raid0 striping and now run dm-crypt 
directly on top of a single disk.

It oopsed several times within 5 minutes when coping a lot of files to the 
disk, got the same message each time:

Unable to handle kernel NULL pointer dereference at 0000000000000000 RIP: 
 [<ffffffff8022b5fc>] blk_recount_segments+0x79/0x3c1
PGD 749f6067 PUD 733eb067 PMD 0 
Oops: 0000 [1] SMP 
CPU 0 
Modules linked in: vfat fat it87 hwmon_vid hwmon eeprom i2c_isa ohci_hcd 
ehci_hcd usbcore i2c_nforce2 pdc202xx_old snd_hda_intel snd_hda_codec 
snd_pcm_oss snd_mixer_oss snd_pcm snd_timer snd soundcore snd_page_alloc 
forcedeth floppy dm_zero dm_mirror
Pid: 7179, comm: pdflush Not tainted 2.6.18.6 #1
RIP: 0010:[<ffffffff8022b5fc>]  [<ffffffff8022b5fc>] blk_recount_segments+0x79/
0x3c1
RSP: 0018:ffff810022a3d738  EFLAGS: 00010246
RAX: 0000000000000036 RBX: 0000000000000000 RCX: 0000000000000000
RDX: 0000000000000000 RSI: ffff81007100dbc0 RDI: ffff81007ed4b928
RBP: 0000000000000000 R08: 0000000000000000 R09: 0000000000000286
R10: ffff81000d9a9800 R11: ffffffff804a31d8 R12: 0000000000000036
R13: 0000000000000360 R14: ffff81007ed4b928 R15: 0000000000000000
FS:  00002b69e0b92dc0(0000) GS:ffffffff80538000(0000) knlGS:0000000000000000
CS:  0010 DS: 0018 ES: 0018 CR0: 000000008005003b
CR2: 0000000000000000 CR3: 0000000073d44000 CR4: 00000000000006e0
Process pdflush (pid: 7179, threadinfo ffff810022a3c000, task ffff81007c6ad760)
Stack:  0036810022a3d780 0000000000000001 0000000000000001 0000000000000000
 ffff81007100dbc0 ffff81007ed4b928 0000000000000036 0000000000000360
 0000000000080000 ffff810070f570a8 ffffffff802ad129 ffff81007100dbc0
Call Trace:
 [<ffffffff802ad129>] __bio_clone+0x78/0x8e
 [<ffffffff802ad171>] bio_clone+0x32/0x39
 [<ffffffff803d9d0c>] crypt_map+0xef/0x330
 [<ffffffff803d383a>] __map_bio+0x47/0xf3
 [<ffffffff803d3923>] clone_bio+0x25/0x47
 [<ffffffff803d3ae2>] __split_bio+0x19d/0x3ca
 [<ffffffff8026064d>] __down_read+0x12/0x9a
 [<ffffffff803d3e19>] dm_request+0x10a/0x119
 [<ffffffff80219e30>] generic_make_request+0x1cd/0x1e4
 [<ffffffff8025ef50>] thread_return+0x0/0xdd
 [<ffffffff802206c4>] mempool_alloc+0x2a/0xe2
 [<ffffffff80230863>] submit_bio+0xd5/0xe0
 [<ffffffff802188c3>] bio_alloc_bioset+0xc6/0x12a
 [<ffffffff8032f638>] xfs_submit_ioend_bio+0x1e/0x27
 [<ffffffff803303c6>] xfs_page_state_convert+0xa64/0xb22
 [<ffffffff80219e30>] generic_make_request+0x1cd/0x1e4
 [<ffffffff8033052b>] xfs_vm_writepage+0xa7/0xde
 [<ffffffff8021aaf1>] mpage_writepages+0x1bb/0x34a
 [<ffffffff80330484>] xfs_vm_writepage+0x0/0xde
 [<ffffffff802579c4>] do_writepages+0x20/0x2f
 [<ffffffff8022d35b>] __writeback_single_inode+0x1c9/0x397
 [<ffffffff803d5a21>] dm_table_any_congested+0x12/0x5f
 [<ffffffff803d3ebe>] dm_any_congested+0x38/0x3f
 [<ffffffff803d5a21>] dm_table_any_congested+0x12/0x5f
 [<ffffffff8021e47a>] sync_sb_inodes+0x1d0/0x2a5
 [<ffffffff8028dbc8>] keventd_create_kthread+0x0/0x5f
 [<ffffffff8024e073>] writeback_inodes+0x85/0xd3
 [<ffffffff8029e141>] background_writeout+0x78/0xb2
 [<ffffffff802530ab>] pdflush+0x13a/0x1e4
 [<ffffffff8029e0c9>] background_writeout+0x0/0xb2
 [<ffffffff80252f71>] pdflush+0x0/0x1e4
 [<ffffffff8022fdbf>] kthread+0xc8/0xf1
 [<ffffffff8025b490>] child_rip+0xa/0x12
 [<ffffffff8028dbc8>] keventd_create_kthread+0x0/0x5f
 [<ffffffff8022fcf7>] kthread+0x0/0xf1
 [<ffffffff8025b486>] child_rip+0x0/0x12


Code: 49 8b 10 48 c1 ea 33 48 89 d0 48 c1 e8 09 48 8b 04 c5 80 f4 
RIP  [<ffffffff8022b5fc>] blk_recount_segments+0x79/0x3c1
 RSP <ffff810022a3d738>
CR2: 0000000000000000
Comment 17 Fredrik Ohrn 2007-02-10 07:16:48 UTC
Inspired by the following thread on LKML: http://www.ussg.iu.edu/hypermail/linux/kernel/0603.3/0826.html I tried using ext3 instead of XFS and now the oopses are gone, the box has been running stable for days.

Let the finger pointing begin! ;)
Comment 18 Olaf Kirch 2007-03-08 08:21:47 UTC
Does this problem still happen with 2.6.20?
Comment 19 Olaf Kirch 2007-03-08 23:12:54 UTC
All oopses show a crash somewhere deep in crypt_map -> bio_clone ->
blk_recount_sectors

Here's what I think is happening:

 -	this happens with write requests
 -	crypt_map submits some io but doesn't consume the whole bio.
	the partial bio is put into io->first_clone
 -	the drive is congested and we call blk_congestion_wait
 -	I/O completes, and the block layer calls crypt_endio
 -	crypt_endio trashes the bio_vec by calling crypt_free_buffer_pages
 -	crypt_map resumes and tries to submit the remainder of the I/O,
	and ends up trying to call bio_blone(io->first_clone)
 -	this crashes because bio_vec[]->bv_page is now NULL

This is an unconfirmed theory.

Now a question: why does the code keep a reference to the first bio in
first_clone? This seems to go back to some gray medieval past when bio_clone
would create bios that shared the bio_vec with the original bio. At least
a comment in the code suggests that this is the reason. But this is certainly
not true for current kernels.

So could this simply be fixed by getting rid of first_clone and always
allocating a fresh bio in crypt_alloc_buffer?
Comment 20 Fredrik Ohrn 2007-03-09 09:48:52 UTC
I've converted all the filesystems to ext3 so I can't easily try with 2.6.20. When I get some spare time I'll see if find and connect an extra disk to test with.
Comment 21 Olaf Kirch 2007-03-20 12:43:34 UTC
Seems I can confirm my theory. I was able to reproduce this on 2.6.20
with a little extra help. I just made crypt_alloc_buffer return after
allocating 4 pages (to simulate memory pressure, which is what's happening
in this bug), and this triggered an oops within seconds (use a file
system that will create write bios of more than one page. eg ext2).

@@ -406,6 +406,7 @@ crypt_alloc_buffer(struct crypt_io *io,
        for (i = clone->bi_idx; i < nr_iovecs; i++) {
                struct bio_vec *bv = bio_iovec_idx(clone, i);

+if (i - clone->bi_idx > 4) break;
                bv->bv_page = mempool_alloc(cc->page_pool, gfp_mask);
                if (!bv->bv_page)
                        break;

It oopsed either in bio_clone, or in bio_put, on a corrupted bio.

I have several patches which I will post to dm-devel for review.
Comment 22 Natalie Protasevich 2007-06-21 13:46:00 UTC
Brian, Fredrik,
Could you please test with the current kernel (2.6.22-rc5). 
https://www.redhat.com/archives/dm-devel/2007-March/msg00041.html
It looks like patches were submitted and merged.
Thanks.
Comment 23 Alasdair G Kergon 2007-11-13 04:35:24 UTC
Assuming this is now resolved.

Note You need to log in before you can comment on or make changes to this bug.