Bug 5948
Summary: | Oops when accessing SATA with device mapper on AMD 64 X2 | ||
---|---|---|---|
Product: | IO/Storage | Reporter: | Brian Daniels (bitmage) |
Component: | LVM2/DM | Assignee: | Alasdair G Kergon (agk) |
Status: | CLOSED CODE_FIX | ||
Severity: | high | CC: | akpm, axboe, christophe, fredrik.ohrn, okir, protasnb |
Priority: | P2 | ||
Hardware: | i386 | ||
OS: | Linux | ||
Kernel Version: | 2.6.15.1smp | Subsystem: | |
Regression: | --- | Bisected commit-id: |
Description
Brian Daniels
2006-01-24 06:46:31 UTC
Please read the bug filing instructions. No binary modules. Sigh. Restarted system at runlevel 3, no nvidia modules. Re-ran backup, 30 min later: Jan 24 11:21:43 hotblue kernel: Unable to handle kernel NULL pointer dereference at 0000000000000007 RIP: Jan 24 11:21:43 hotblue kernel: <ffffffff80220175>{blk_recount_segments+133} Jan 24 11:21:43 hotblue kernel: PGD 3d007067 PUD 33891067 PMD 0 Jan 24 11:21:43 hotblue kernel: Oops: 0000 [1] SMP Jan 24 11:21:43 hotblue kernel: CPU 0 Jan 24 11:21:43 hotblue kernel: Modules linked in: sha256 aes_x86_64 dm_crypt parport_pc lp parport ipt_REJECT ipt_state ip_conntrack iptable_filter ip_tables nls_utf8 ntfs vfat fat video button battery ac powermate ohci1394 ieee1394 ohci_hcd ehci_hcd i2c_nforce2 i2c_core snd_intel8x0 snd_ac97_codec snd_ac97_bus snd_pcm_oss snd_mixer_oss snd_pcm snd_timer snd soundcore snd_page_alloc r8169 forcedeth floppy dm_mod sata_nv libata sd_mod scsi_mod Jan 24 11:21:43 hotblue kernel: Pid: 4246, comm: tar Not tainted 2.6.15.1smp #1 Jan 24 11:21:43 hotblue kernel: RIP: 0010:[<ffffffff80220175>] <ffffffff80220175>{blk_recount_segments+133} Jan 24 11:21:43 hotblue kernel: RSP: 0018:ffff810033f7f4c8 EFLAGS: 00010202 Jan 24 11:21:43 hotblue kernel: RAX: 0000000000000000 RBX: 8e38e38e38e38e39 RCX: 000000000000000d Jan 24 11:21:43 hotblue kernel: RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffff81003f867108 Jan 24 11:21:43 hotblue kernel: RBP: 0000000000000000 R08: 0000000000000000 R09: 0000000000000000 Jan 24 11:21:43 hotblue kernel: R10: ffff810016ee4000 R11: 0000000000000000 R12: ffff81001d24f5c0 Jan 24 11:21:43 hotblue kernel: R13: 0000000000020000 R14: 00000000000001e0 R15: 0000000000000000 Jan 24 11:21:44 hotblue kernel: FS: 00002aaaaaad5b00(0000) GS:ffffffff8053c800(0000) knlGS:0000000000000000 Jan 24 11:21:44 hotblue kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b Jan 24 11:21:44 hotblue kernel: CR2: 0000000000000007 CR3: 000000003392b000 CR4: 00000000000006e0 Jan 24 11:21:44 hotblue kernel: Process tar (pid: 4246, threadinfo ffff810033f7e000, task ffff81003e436760) Jan 24 11:21:44 hotblue kernel: Stack: 000d000000000048 000000013ffb7fc0 0000000100010010 0000000000000000 Jan 24 11:21:44 hotblue kernel: ffff81003f867108 ffff81001d24f5c0 ffff81000a844540 ffff81003f867108 Jan 24 11:21:44 hotblue kernel: 0000000000020000 00000000000001e0 Jan 24 11:21:44 hotblue kernel: Call Trace:<ffffffff80189c86>{bio_clone+166} <ffffffff8821306e>{:dm_crypt:crypt_map+238} Jan 24 11:21:44 hotblue kernel: <ffffffff880472a1>{:dm_mod:__map_bio+81} <ffffffff88047629>{:dm_mod:__split_bio+361} Jan 24 11:21:44 hotblue kernel: <ffffffff80379541>{__down_read+49} <ffffffff88047935>{:dm_mod:dm_request+261} Jan 24 11:21:44 hotblue kernel: <ffffffff80222829>{generic_make_request+361} <ffffffff8018997e>{bio_alloc_bioset+382} Jan 24 11:21:44 hotblue kernel: <ffffffff80222918>{submit_bio+216} <ffffffff801abb52>{mpage_bio_submit+34} Jan 24 11:21:44 hotblue kernel: <ffffffff801ac49c>{__mpage_writepage+860} <ffffffff801eceb0>{ext2_get_block+0} Jan 24 11:21:44 hotblue kernel: <ffffffff80379c09>{_read_unlock_irq+9} <ffffffff8015e9ae>{find_get_pages_tag+158} Jan 24 11:21:44 hotblue kernel: <ffffffff801ac8c3>{mpage_writepages+579} <ffffffff801ed4d0>{ext2_writepage+0} Jan 24 11:21:44 hotblue kernel: <ffffffff801eceb0>{ext2_get_block+0} <ffffffff801aaf5b>{__writeback_single_inode+571} Jan 24 11:21:44 hotblue kernel: <ffffffff801ecfb5>{ext2_get_block+261} <ffffffff801ab674>{sync_sb_inodes+500} Jan 24 11:21:44 hotblue kernel: <ffffffff801ab9d6>{writeback_inodes+166} <ffffffff80164ae4>{balance_dirty_pages_ratelimited+228} Jan 24 11:21:44 hotblue kernel: <ffffffff801602b5>{generic_file_buffered_write+1285} Jan 24 11:21:44 hotblue kernel: <ffffffff8013a1e5>{current_fs_time+85} <ffffffff801a049e>{inode_update_time+62} Jan 24 11:21:44 hotblue kernel: <ffffffff801609aa>{__generic_file_aio_write_nolock+938} Jan 24 11:21:44 hotblue kernel: <ffffffff8012d8ab>{try_to_wake_up+1243} <ffffffff80160a75>{__generic_file_write_nolock+149} Jan 24 11:21:44 hotblue kernel: <ffffffff8012f323>{__wake_up+67} <ffffffff80191c68>{pipe_readv+840} Jan 24 11:21:44 hotblue kernel: <ffffffff8014b9b0>{autoremove_wake_function+0} <ffffffff80160c0e>{generic_file_write+126} Jan 24 11:21:44 hotblue kernel: <ffffffff801847f9>{vfs_write+233} <ffffffff80184993>{sys_write+83} Jan 24 11:21:44 hotblue kernel: <ffffffff8010dc56>{system_call+126} Jan 24 11:21:44 hotblue kernel: Jan 24 11:21:44 hotblue kernel: Code: 49 0f b6 41 07 4c 89 c9 48 8b 04 c5 40 5e 54 80 48 2b 88 20 Jan 24 11:21:44 hotblue kernel: RIP <ffffffff80220175>{blk_recount_segments+133} RSP <ffff810033f7f4c8> Jan 24 11:21:44 hotblue kernel: CR2: 0000000000000007 looks like device-mapper or block layer problem... It's possibly a devicemapper bug. Are you able to reproduce it on a system which has never had the nvidia module loaded? I only have the one system. If you mean can I repro it on a clean boot never loading that module, yes. If you mean literally never having had the nvidia code on the system since install, it's too late for me. Is there something I can clear or remove to test the devicemapper issue? Please can you tell us your device-mapper configuration by running: dmsetup info -c dmsetup table dmsetup status NB Your password may appear coded into a long hex string in the output on the 'crypt' line - if you want to keep the password secret, delete that hex string before attaching the output here! Also please tell us any mount or filesystem options you're using. (eg if it's ext3, journal=? data=? etc.) # dmsetup info -c Name Maj Min Stat Open Targ Event UUID volume1 253 0 L--w 1 1 0 # dmsetup table volume1: 0 118800612 crypt aes-cbc-essiv:sha256 XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX 0 22:1 0 # dmsetup status volume1: 0 118800612 crypt Hex string replaced with X's - thanks for that caution! Filesystems on the main SATA hd's are ext3. The backup IDE hd is ext2 running over an encrypted device. With everything mounted: $mount /dev/md1 on / type ext3 (rw) none on /proc type proc (rw) none on /sys type sysfs (rw) none on /dev/pts type devpts (rw,gid=5,mode=620) usbfs on /proc/bus/usb type usbfs (rw) /dev/md0 on /boot type ext3 (rw) none on /dev/shm type tmpfs (rw) /dev/sdb1 on /mnt/scratch type vfat (rw,noexec,nosuid,nodev,uid=500,gid=500,umask=000) /dev/sda1 on /mnt/XP type ntfs (ro,noexec,nosuid,nodev,uid=500,gid=500) none on /proc/sys/fs/binfmt_misc type binfmt_misc (rw) /dev/mapper/volume1 on /mnt/crypto type ext2 (rw) Don't know if this will make any difference to your troubleshooting, but I tried running the backup with 'taskset -c 1' to force processor affinity to CPU 1 only, and verified that all the tar processes were showing that CPU mask. It still failed with the same OOPS. Ok, scratch the bit about IDE being involved. I purchased an IDE-to-SATA converter and moved my IDE backup drive onto it. Now all drives were connected to the SATA bus. I then ran a backup, and Oopsed out once more. So, the problem is reproducible with no IDE involvement. I re-verified that running a backup in non-SMP mode works. It completed sucessfully, but had one error: hotblue kernel: hdc: drive_cmd: status=0x51 { DriveReady SeekComplete Error } hotblue kernel: hdc: drive_cmd: error=0x04 { DriveStatusError } So I'm wondering if the issue might only occur when the drive reports an error? That could explain the seeming randomness of the Oopses. Scratch that last. The hdc error was after backup completion, when setting the spindown parameter on the drive with hdparm. The Ooopses have been happening well before that point is reached. I've just stumbled upon this oops as well, in a plain v2.6.19.2 kernel (no distro patches or nVidia binary drivers). My hardware is a Asus M2N-E motherboard (nForce MCP55 chipset) with a AMD 64 X2 5200+ CPU. It happened when I was doing some benchmarking with iozone and is 100% reproducible if I boot with mem=128M, if I boot with mem=512M or higher the oops goes away. Unable to handle kernel NULL pointer dereference at 0000000000000000 RIP: [<ffffffff8022adcb>] blk_recount_segments+0x79/0x3c1 PGD 46a2067 PUD 6bbd067 PMD 0 Oops: 0000 [1] SMP CPU 0 Modules linked in: it87 hwmon_vid hwmon eeprom i2c_isa usbhid ohci_hcd ehci_hcd usbcore i2c_nforce2 snd_hda_intel snd_hda_codec snd_pcm_oss snd_mixer_oss snd_pcm snd_timer snd soundcore snd_page_alloc Pid: 399, comm: kcryptd/0 Not tainted 2.6.19.2 #1 RIP: 0010:[<ffffffff8022adcb>] [<ffffffff8022adcb>] blk_recount_segments+0x79/ 0x3c1 RSP: 0018:ffff810007d73d18 EFLAGS: 00010246 RAX: 000000000000001c RBX: 0000000000000000 RCX: 0000000000000000 RDX: 0000000000000000 RSI: ffff8100063b3740 RDI: ffff810007d02cc8 RBP: 0000000000000000 R08: 0000000000000000 R09: 0000000000000286 R10: ffff810001963800 R11: ffffffff804a7cd8 R12: ffff8100067ccd40 R13: 0000000000000062 R14: ffff810007d02cc8 R15: 0000000000000000 FS: 00002b9ce609a520(0000) GS:ffffffff80508000(0000) knlGS:0000000000000000 CS: 0010 DS: 0018 ES: 0018 CR0: 000000008005003b CR2: 0000000000000000 CR3: 0000000006806000 CR4: 00000000000006e0 Process kcryptd/0 (pid: 399, threadinfo ffff810007d72000, task ffff810007564100) Stack: 001c810007564100 0000000000000001 0000000000000001 0000000000000000 ffff8100063b3740 ffff810007d02cc8 ffff8100067ccd40 0000000000000062 0000000000000620 0000000000080000 ffffffff802bc2cf ffff8100063b3740 Call Trace: [<ffffffff802bc2cf>] __bio_clone+0x78/0x8e [<ffffffff803e9fe2>] kcryptd_do_work+0x1d6/0x35f [<ffffffff8022b296>] __wake_up+0x35/0x4b [<ffffffff803e9e0c>] kcryptd_do_work+0x0/0x35f [<ffffffff80248d4d>] run_workqueue+0x9a/0xe1 [<ffffffff80245974>] worker_thread+0x0/0x12e [<ffffffff8028c299>] keventd_create_kthread+0x0/0x5f [<ffffffff80245a70>] worker_thread+0xfc/0x12e [<ffffffff8027bf2f>] default_wake_function+0x0/0xe [<ffffffff8028c299>] keventd_create_kthread+0x0/0x5f [<ffffffff8027bf2f>] default_wake_function+0x0/0xe [<ffffffff8028c299>] keventd_create_kthread+0x0/0x5f [<ffffffff8022ef15>] kthread+0xc8/0xf1 [<ffffffff80259ed8>] child_rip+0xa/0x12 [<ffffffff8028c299>] keventd_create_kthread+0x0/0x5f [<ffffffff8022ee4d>] kthread+0x0/0xf1 [<ffffffff80259ece>] child_rip+0x0/0x12 Code: 49 8b 10 48 c1 ea 33 48 89 d0 48 c1 e8 09 48 8b 04 c5 40 00 RIP [<ffffffff8022adcb>] blk_recount_segments+0x79/0x3c1 RSP <ffff810007d73d18> CR2: 0000000000000000 To reproduce the oops run this script (vg1 should be a LVM2 volume group with 2 SATA drives): #!/bin/bash function run-test () { echo Testing stripe size $1, record size $2 file=/root/result.stripe$1.record$2.$3.txt # Setup lvcreate -i 2 -I $1 -l 17652 -n a1 vg1 dev=/dev/mapper/vg1-a1 size=`blockdev --getsize $dev` echo 0 $size crypt $3 aab0c395bf53747532444025109c0387 0 $dev 0 | dmsetup create vg1-a1-crypt dev=/dev/mapper/vg1-a1-crypt mkfs.xfs -f $dev > $file mount $dev /root/bench cd /root/bench # Test /usr/local/bin/iozone -i 0 -C -l 10 -L 64 -r $2 -s 100m -S 64 -t 10 -u 10 -x >> $file # Cleanup cd /root umount /root/bench dmsetup remove $dev lvchange -a n vg1 lvremove vg1/a1 } run-test 64 4 aes-cbc-plain Add Jens & Christophe - dm-crypt and block are implicated... I forgot to mention, when running the benchmark on the striped disks without encryption it works just fine, so dm-crypt is definitely an important part of the equation. If there is any additional testing I can do to narrow things down I'd be happy to help, however the machine will be put into production on Monday so I would need instructions on what to do before that. Turns out the oops in blk_recount_segments appears in the production environment as well, the callstack is different compared to when I did the benchmarking though. This may be becasue I had to go back to 2.6.18.6 (it turned out that networking is broken in 2.6.19 and 2.6.20 when running at gigabit speed). Another change is that we removed the raid0 striping and now run dm-crypt directly on top of a single disk. It oopsed several times within 5 minutes when coping a lot of files to the disk, got the same message each time: Unable to handle kernel NULL pointer dereference at 0000000000000000 RIP: [<ffffffff8022b5fc>] blk_recount_segments+0x79/0x3c1 PGD 749f6067 PUD 733eb067 PMD 0 Oops: 0000 [1] SMP CPU 0 Modules linked in: vfat fat it87 hwmon_vid hwmon eeprom i2c_isa ohci_hcd ehci_hcd usbcore i2c_nforce2 pdc202xx_old snd_hda_intel snd_hda_codec snd_pcm_oss snd_mixer_oss snd_pcm snd_timer snd soundcore snd_page_alloc forcedeth floppy dm_zero dm_mirror Pid: 7179, comm: pdflush Not tainted 2.6.18.6 #1 RIP: 0010:[<ffffffff8022b5fc>] [<ffffffff8022b5fc>] blk_recount_segments+0x79/ 0x3c1 RSP: 0018:ffff810022a3d738 EFLAGS: 00010246 RAX: 0000000000000036 RBX: 0000000000000000 RCX: 0000000000000000 RDX: 0000000000000000 RSI: ffff81007100dbc0 RDI: ffff81007ed4b928 RBP: 0000000000000000 R08: 0000000000000000 R09: 0000000000000286 R10: ffff81000d9a9800 R11: ffffffff804a31d8 R12: 0000000000000036 R13: 0000000000000360 R14: ffff81007ed4b928 R15: 0000000000000000 FS: 00002b69e0b92dc0(0000) GS:ffffffff80538000(0000) knlGS:0000000000000000 CS: 0010 DS: 0018 ES: 0018 CR0: 000000008005003b CR2: 0000000000000000 CR3: 0000000073d44000 CR4: 00000000000006e0 Process pdflush (pid: 7179, threadinfo ffff810022a3c000, task ffff81007c6ad760) Stack: 0036810022a3d780 0000000000000001 0000000000000001 0000000000000000 ffff81007100dbc0 ffff81007ed4b928 0000000000000036 0000000000000360 0000000000080000 ffff810070f570a8 ffffffff802ad129 ffff81007100dbc0 Call Trace: [<ffffffff802ad129>] __bio_clone+0x78/0x8e [<ffffffff802ad171>] bio_clone+0x32/0x39 [<ffffffff803d9d0c>] crypt_map+0xef/0x330 [<ffffffff803d383a>] __map_bio+0x47/0xf3 [<ffffffff803d3923>] clone_bio+0x25/0x47 [<ffffffff803d3ae2>] __split_bio+0x19d/0x3ca [<ffffffff8026064d>] __down_read+0x12/0x9a [<ffffffff803d3e19>] dm_request+0x10a/0x119 [<ffffffff80219e30>] generic_make_request+0x1cd/0x1e4 [<ffffffff8025ef50>] thread_return+0x0/0xdd [<ffffffff802206c4>] mempool_alloc+0x2a/0xe2 [<ffffffff80230863>] submit_bio+0xd5/0xe0 [<ffffffff802188c3>] bio_alloc_bioset+0xc6/0x12a [<ffffffff8032f638>] xfs_submit_ioend_bio+0x1e/0x27 [<ffffffff803303c6>] xfs_page_state_convert+0xa64/0xb22 [<ffffffff80219e30>] generic_make_request+0x1cd/0x1e4 [<ffffffff8033052b>] xfs_vm_writepage+0xa7/0xde [<ffffffff8021aaf1>] mpage_writepages+0x1bb/0x34a [<ffffffff80330484>] xfs_vm_writepage+0x0/0xde [<ffffffff802579c4>] do_writepages+0x20/0x2f [<ffffffff8022d35b>] __writeback_single_inode+0x1c9/0x397 [<ffffffff803d5a21>] dm_table_any_congested+0x12/0x5f [<ffffffff803d3ebe>] dm_any_congested+0x38/0x3f [<ffffffff803d5a21>] dm_table_any_congested+0x12/0x5f [<ffffffff8021e47a>] sync_sb_inodes+0x1d0/0x2a5 [<ffffffff8028dbc8>] keventd_create_kthread+0x0/0x5f [<ffffffff8024e073>] writeback_inodes+0x85/0xd3 [<ffffffff8029e141>] background_writeout+0x78/0xb2 [<ffffffff802530ab>] pdflush+0x13a/0x1e4 [<ffffffff8029e0c9>] background_writeout+0x0/0xb2 [<ffffffff80252f71>] pdflush+0x0/0x1e4 [<ffffffff8022fdbf>] kthread+0xc8/0xf1 [<ffffffff8025b490>] child_rip+0xa/0x12 [<ffffffff8028dbc8>] keventd_create_kthread+0x0/0x5f [<ffffffff8022fcf7>] kthread+0x0/0xf1 [<ffffffff8025b486>] child_rip+0x0/0x12 Code: 49 8b 10 48 c1 ea 33 48 89 d0 48 c1 e8 09 48 8b 04 c5 80 f4 RIP [<ffffffff8022b5fc>] blk_recount_segments+0x79/0x3c1 RSP <ffff810022a3d738> CR2: 0000000000000000 Inspired by the following thread on LKML: http://www.ussg.iu.edu/hypermail/linux/kernel/0603.3/0826.html I tried using ext3 instead of XFS and now the oopses are gone, the box has been running stable for days. Let the finger pointing begin! ;) Does this problem still happen with 2.6.20? All oopses show a crash somewhere deep in crypt_map -> bio_clone -> blk_recount_sectors Here's what I think is happening: - this happens with write requests - crypt_map submits some io but doesn't consume the whole bio. the partial bio is put into io->first_clone - the drive is congested and we call blk_congestion_wait - I/O completes, and the block layer calls crypt_endio - crypt_endio trashes the bio_vec by calling crypt_free_buffer_pages - crypt_map resumes and tries to submit the remainder of the I/O, and ends up trying to call bio_blone(io->first_clone) - this crashes because bio_vec[]->bv_page is now NULL This is an unconfirmed theory. Now a question: why does the code keep a reference to the first bio in first_clone? This seems to go back to some gray medieval past when bio_clone would create bios that shared the bio_vec with the original bio. At least a comment in the code suggests that this is the reason. But this is certainly not true for current kernels. So could this simply be fixed by getting rid of first_clone and always allocating a fresh bio in crypt_alloc_buffer? I've converted all the filesystems to ext3 so I can't easily try with 2.6.20. When I get some spare time I'll see if find and connect an extra disk to test with. Seems I can confirm my theory. I was able to reproduce this on 2.6.20 with a little extra help. I just made crypt_alloc_buffer return after allocating 4 pages (to simulate memory pressure, which is what's happening in this bug), and this triggered an oops within seconds (use a file system that will create write bios of more than one page. eg ext2). @@ -406,6 +406,7 @@ crypt_alloc_buffer(struct crypt_io *io, for (i = clone->bi_idx; i < nr_iovecs; i++) { struct bio_vec *bv = bio_iovec_idx(clone, i); +if (i - clone->bi_idx > 4) break; bv->bv_page = mempool_alloc(cc->page_pool, gfp_mask); if (!bv->bv_page) break; It oopsed either in bio_clone, or in bio_put, on a corrupted bio. I have several patches which I will post to dm-devel for review. Brian, Fredrik, Could you please test with the current kernel (2.6.22-rc5). https://www.redhat.com/archives/dm-devel/2007-March/msg00041.html It looks like patches were submitted and merged. Thanks. Assuming this is now resolved. |