Bug 9718 - ext3 oops on unmount of corrupted image
ext3 oops on unmount of corrupted image
Status: CLOSED CODE_FIX
Product: File System
Classification: Unclassified
Component: ext3
All Linux
: P1 normal
Assigned To: Andrew Morton
:
Depends on:
Blocks:
  Show dependency treegraph
 
Reported: 2008-01-09 07:34 UTC by Eric Sesterhenn
Modified: 2008-01-16 13:37 UTC (History)
1 user (show)

See Also:
Kernel Version: 2.6.24-rc7-gcdf71a10
Tree: Mainline
Regression: ---


Attachments
Image before and after testing (291.12 KB, application/x-bzip)
2008-01-09 08:36 UTC, Eric Sesterhenn
Details

Description Eric Sesterhenn 2008-01-09 07:34:26 UTC
Latest working kernel version: -
Earliest failing kernel version: 2.6.24-rc7-gcdf71a10
Distribution: Gentoo
Hardware Environment: Pentium III (Coppermine)
Software Environment:

Linux snakebyte-desktop 2.6.24-rc7-gcdf71a10 #29 PREEMPT Wed Jan 9 16:03:37 CET 2008 i686 GNU/Linux
 
Gnu C                  4.1.3
Gnu make               3.81
binutils               2.18
util-linux             2.13
mount                  2.13
module-init-tools      3.3-pre2
e2fsprogs              1.40.2
reiserfsprogs          3.6.19
reiser4progs           1.0.6
xfsprogs               2.9.0
pcmciautils            014
PPP                    2.4.4
Linux C Library        2.6.1
Dynamic linker (ldd)   2.6.1
Procps                 3.2.7
Net-tools              1.60
Console-tools          0.2.3
Sh-utils               5.97
udev                   113


Problem Description:

I added fsx-linux from ltp and fsstress to the fsfuzzer tests, now during unmount i sometimes get the following oops: 

[ 4719.741202] BUG: unable to handle kernel paging request at virtual address fffffffa
[ 4719.741261] printing eip: c013ece3 *pde = 0087f067 *pte = 00000000 
[ 4719.741409] Oops: 0000 [#1] PREEMPT 
[ 4719.741524] Modules linked in:
[ 4719.741563] 
[ 4719.741586] Pid: 12258, comm: umount Not tainted (2.6.24-rc7-g3ce54450 #28)
[ 4719.741669] EIP: 0060:[<c013ece3>] EFLAGS: 00010046 CPU: 0
[ 4719.741718] EIP is at __lock_acquire+0x1d3/0x1050
[ 4719.741741] EAX: 00000002 EBX: 00000246 ECX: fffffff6 EDX: 00000002
[ 4719.741765] ESI: 00000000 EDI: 00000000 EBP: cca78da4 ESP: cca78d3c
[ 4719.741848]  DS: 007b ES: 007b FS: 0000 GS: 0033 SS: 0068
[ 4719.741872] Process umount (pid: 12258, ti=cca78000 task=cf208000 task.ti=cca78000)
[ 4719.741896] Stack: cf43e0f0 c11e87c0 cf96bf00 cf43e000 cf96bf00 cf43e0f0 cca78d80 c016ce3f 
[ 4719.742248]        00000002 00000000 fffffff6 cf208000 cf208000 cca78d88 c013e453 00000000 
[ 4719.742542]        cf208000 c016eb25 00000287 cca78da8 c013e5f3 cf43e0f0 00000046 00000246 
journal_force_commit+0xd/0x30
[ 4719.743648]  [<c01faaf5>] ext3_force_commit+0x25/0x30
[ 4719.743704]  [<c01f4b7c>] ext3_write_inode+0x2c/0x40
[ 4719.743805]  [<c018dfd6>] __writeback_single_inode+0x266/0x330
[ 4719.743854]  [<c018e4b8>] sync_sb_inodes+0x1c8/0x290
[ 4719.743955]  [<c018e5fa>] sync_inodes_sb+0x7a/0x90
[ 4719.743997]  [<c0173711>] __fsync_super+0x61/0x70
[ 4719.744113]  [<c017372b>] fsync_super+0xb/0x20
[ 4719.744157]  [<c0173942>] generic_shutdown_super+0x22/0xd0
[ 4719.744258]  [<c01739ff>] kill_block_super+0xf/0x20
[ 4719.744300]  [<c0173ab7>] deactivate_super+0x47/0x60
[ 4719.744401]  [<c0187914>] mntput_no_expire+0x44/0x60
[ 4719.744457]  [<c0178c45>] path_release_on_umount+0x15/0x20
[ 4719.744565]  [<c0187f01>] sys_umount+0x41/0x210
[ 4719.744609]  [<c01880e7>] sys_oldumount+0x17/0x20
[ 4719.744710]  [<c0104272>] sysenter_past_esp+0x5f/0xa5
[ 4719.744754]  =======================
[ 4719.744775] Code: d3 37 75 c0 e8 6f 09 fe ff c7 04 24 7c 5a 75 c0 e8 c3 8d 74 26 00 8b 4d c0 <8b> 71 04 85 f6 0f 84 78 fe ff ff ff 86 c0 00 00 00 89 f0 e8 05 
[ 4719.746918] EIP: [<c013ece3>] __lock_acquire+0x1d3/0x1050 SS:ESP 0068:cca78d3c
[ 4719.747041] ---[ end trace a415aae57fb6632a ]---


So far i was unable to reproduce this only with fsx or fstress on the corrupted filesystem, havent seen this on ext2 so far, will try with ext4
Comment 1 Eric Sandeen 2008-01-09 08:27:46 UTC
Eric, if you'd like to send me the image, I'll try to find time to take a look.

Thanks,

-Eric
Comment 2 Eric Sesterhenn 2008-01-09 08:36:08 UTC
Created attachment 14380 [details]
Image before and after testing

the .img is the image after the run of run_test, the .img.bck is the one which is not modified yet. I only added the following lines to run_test, with snakebyte being a local user since it once happened to me that one of both tests went wild and deleted the entire file instead of only the directory

mkdir /media/test/stress
chown snakebyte:snakebyte /media/test/stress
echo "+++ fsstress..."
sudo -u snakebyte fsstress -c -d /media/test/stress -l 20
echo "+++ fsx..."
sudo -u snakebyte fsx -N 100 /media/test/stress/test
Comment 3 Eric Sandeen 2008-01-15 20:22:29 UTC
Eric, which version of fsfuzzer are you using; did you add the fsstress/fsx tests at the end?

looks like maybe we missed an ERR_PTR in there somewhere, dereferencing fffffffa...

do you know where __lock_acquire+0x1d3/0x1050 lands in your build, or can you attach some disassembly here?

Thanks,

-Eric
Comment 4 Eric Sesterhenn 2008-01-16 04:20:38 UTC
Hi, i am basically using fsfuzzer-0.6-lmh with some modifications. To reduce the testcase, in the run_test file,  the perform_operations() function only contains the following now:

perform_operations() {
  # perform basic operations
  mkdir /media/test/stress
  chown snakebyte:snakebyte /media/test/stress
  echo "+++ fsx..."
  sudo -u snakebyte fsx -N 100 /media/test/stress/test
}


I changed the fuzzer to mount the ext3 file system with -o debug, which gave some additional output;


[  408.387190] EXT3-fs error (device loop0): ext3_new_block: Allocating block in system zone - blocks from 11, length 1
[  408.388489] EXT3-fs error (device loop0): ext3_new_block: Allocating block in system zone - blocks from 15, length 1
[  408.404410] EXT3-fs error (device loop0): ext3_free_blocks: Freeing blocks in system zones - Block = 15, count = 1
[  408.406595] EXT3-fs error (device loop0): ext3_free_blocks: Freeing blocks in system zones - Block = 11, count = 1
[  408.502057] Aborting journal on device loop0.
[  408.504328] journal commit I/O error
[  408.506609] BUG: unable to handle kernel paging request at virtual address fffffffa
[  408.506682] printing eip: c013ece3 *pde = 0087f067 *pte = 00000000 
[  408.506832] Oops: 0000 [#1] PREEMPT 
[  408.506947] Modules linked in:
[  408.506986] 
[  408.507010] Pid: 3920, comm: umount Not tainted (2.6.24-rc7-gcdf71a10 #29)
[  408.507092] EIP: 0060:[<c013ece3>] EFLAGS: 00010046 CPU: 0
[  408.507140] EIP is at __lock_acquire+0x1d3/0x1050
[  408.507164] EAX: 00000002 EBX: 00000246 ECX: fffffff6 EDX: 00000002
[  408.507188] ESI: 00000000 EDI: 00000000 EBP: cf32cda4 ESP: cf32cd3c
[  408.507270]  DS: 007b ES: 007b FS: 0000 GS: 0033 SS: 0068
[  408.507295] Process umount (pid: 3920, ti=cf32c000 task=c23a4000 task.ti=cf32c000)
[  408.507318] Stack: cf785000 c11ef0a0 cf96af00 cf785050 cf96af00 cf785000 cf32cd80 c016cebf 
[  408.507674]        00000002 00000000 fffffff6 c23a4000 c23a4000 cf32cd88 c013e453 00000000 
[  408.507970]        c23a4000 c016eba5 00000287 cf32cda8 c013e5f3 cf785000 00000046 00000246 
journal_force_commit+0xd/0x30
[  408.509085]  [<c01fab75>] ext3_force_commit+0x25/0x30
[  408.509139]  [<c01f4bfc>] ext3_write_inode+0x2c/0x40
[  408.509241]  [<c018e056>] __writeback_single_inode+0x266/0x330
[  408.509291]  [<c018e538>] sync_sb_inodes+0x1c8/0x290
[  408.509392]  [<c018e67a>] sync_inodes_sb+0x7a/0x90
[  408.509435]  [<c0173791>] __fsync_super+0x61/0x70
[  408.509550]  [<c01737ab>] fsync_super+0xb/0x20
[  408.509593]  [<c01739c2>] generic_shutdown_super+0x22/0xd0
[  408.509695]  [<c0173a7f>] kill_block_super+0xf/0x20
[  408.509737]  [<c0173b37>] deactivate_super+0x47/0x60
[  408.509838]  [<c0187994>] mntput_no_expire+0x44/0x60
[  408.509893]  [<c0178cc5>] path_release_on_umount+0x15/0x20
[  408.510003]  [<c0187f81>] sys_umount+0x41/0x210
[  408.510046]  [<c0188167>] sys_oldumount+0x17/0x20
[  408.510148]  [<c0104272>] sysenter_past_esp+0x5f/0xa5
[  408.510191]  =======================
[  408.510212] Code: c3 37 75 c0 e8 6f 09 fe ff c7 04 24 74 5a 75 c0 e8 c3 8d 74 26 00 8b 4d c0 <8b> 71 04 85 f6 0f 84 78 fe ff ff ff 86 c0 00 00 00 89 f0 e8 05 
[  408.512372] EIP: [<c013ece3>] __lock_acquire+0x1d3/0x1050 SS:ESP 0068:cf32cd3c
[  408.512495] ---[ end trace d536df78a3986ddd ]---


The oops seems to happen in kernel/lockdep.c at:

static int __lock_acquire(...) 
...
        if (unlikely(subclass >= MAX_LOCKDEP_SUBCLASSES)) {
                debug_locks_off();
                printk("BUG: MAX_LOCKDEP_SUBCLASSES too low!\n");
                printk("turning off the locking correctness validator.\n");
                return 0;
        }

        if (!subclass)
                class = lock->class_cache;   <--- ### here ###
        /*
         * Not cached yet or subclass?
         */



Comment 5 Eric Sandeen 2008-01-16 07:55:50 UTC
Great, thanks for the extra info.
Comment 6 Jonas Bonn 2008-01-16 13:15:00 UTC
I submitted a patch containing the following to linux-ext4 mailing list yesterday... it's in -mm already.  The above oops is caused by __lock_acquire being called even though handle was set with ERR_PTR... in the oops above, err=-EROFS.

diff --git a/fs/jbd/transaction.c b/fs/jbd/transaction.c
index 08ff6c7..038ed74 100644
--- a/fs/jbd/transaction.c
+++ b/fs/jbd/transaction.c
@@ -288,10 +288,12 @@ handle_t *journal_start(journal_t *journal, int nblocks)
                jbd_free_handle(handle);
                current->journal_info = NULL;
                handle = ERR_PTR(err);
+               goto out;
        }
 
        lock_acquire(&handle->h_lockdep_map, 0, 0, 0, 2, _THIS_IP_);
 
+out:
        return handle;
 }
Comment 7 Eric Sandeen 2008-01-16 13:30:50 UTC
Jonas, thanks.  I saw that patch but failed to put two and two together :)
Comment 8 Eric Sesterhenn 2008-01-16 13:37:24 UTC
thanks, fixes the bug for me too

Note You need to log in before you can comment on or make changes to this bug.