Bug 4963
Summary: | Encounter this Kernel oops while running file system race around test suite | ||
---|---|---|---|
Product: | File System | Reporter: | Sharyathi (sharyathi) |
Component: | JFS | Assignee: | Dave Kleikamp (shaggy) |
Status: | REJECTED DUPLICATE | ||
Severity: | normal | CC: | bnpoorni, maneesh, raosanth, sglass |
Priority: | P2 | ||
Hardware: | i386 | ||
OS: | Linux | ||
Kernel Version: | 2.6.13-rc3-mm2 | Subsystem: | |
Regression: | --- | Bisected commit-id: |
Description
Sharyathi
2005-07-29 06:31:10 UTC
Would it be possible to determine where in txLock the trap occurred? At the least, could you "make fs/jfs/jfs_txnmgr.s" so that I can compare the failing code to the assembler listing? I don't have a xeon system to compile it on myself. OK, well there's one known bug in the BIO layer which could cause the dm layer to return bad data. That might explain the ext2 warning, although JFS shouldn't be oopsing. I'd suggest that you add ftp://ftp.kernel.org/pub/linux/kernel/people/akpm/patches/2.6/2.6.13-rc3/2.6.13-rc3-mm3/broken-out/bio_clone-fix.patch. That's already in 2.6.13-rc4. I had captured the dump and here is the preliminary analysis over the dump. Let me know If you need more information Meanwhile I will test on the rc4-mm1 and let you know, if this bug is fixed ----------------------------------- GNU gdb Red Hat Linux (6.3.0.0-0.31rh) Copyright 2004 Free Software Foundation, Inc. GDB is free software, covered by the GNU General Public License, and you are welcome to change it and/or distribute copies of it under certain conditions. Type "show copying" to see the conditions. There is absolutely no warranty for GDB. Type "show warranty" for details. This GDB was configured as "i386-redhat-linux-gnu"...Using host libthread_db library "/lib/tls/libthread_db.so.1". ----------------------------- #0 0xc01e3655 in journal_unmap_buffer (journal=0xc5f92200, bh=0xdfd95c9c) at include/asm/bitops.h:143 143 __asm__ __volatile__( LOCK_PREFIX ----------------------------- (gdb) info thread 4 process 1122 submit_bh (rw=1, bh=0xe3bc8754) at fs/buffer.c:2752 3 process 0 0xc01e3655 in journal_unmap_buffer (journal=0xc5f92200, bh=0xdfd95c9c) at include/asm/bitops.h:143 2 process 0 0xc01e3655 in journal_unmap_buffer (journal=0xc5f92200, bh=0xdfd95c9c) at include/asm/bitops.h:143 * 1 process 2284 0xc01e3655 in journal_unmap_buffer (journal=0xc5f92200, bh=0xdfd95c9c) at include/asm/bitops.h:143 ----------------------------- (gdb) bt #0 0xc01e3655 in journal_unmap_buffer (journal=0xc5f92200, bh=0xdfd95c9c) at include/asm/bitops.h:143 #1 0xc01e3909 in journal_invalidatepage (journal=0xc5f92200, page=0xc52385a0, offset=0) at fs/jbd/transaction.c:1906 #2 0xc0155b16 in do_invalidatepage (page=0xc5e98030, offset=3755564188) at mm/truncate.c:25 #3 0xc0155b60 in truncate_complete_page (mapping=0xc5e98030, page=0xc52385a0) at mm/truncate.c:52 #4 0xc0155d25 in truncate_inode_pages_range (mapping=0xf5ed7850, lstart=warning: Unhandled dwarf expresion opcode DW_OP_piece 3755564188, lend=warning: Unhandled dwarf expresion opcode DW_OP_piece -987527776) at mm/truncate.c:162 #5 0xc0155ee7 in truncate_inode_pages (mapping=0xc5e98030, lstart=warning: Unhandled dwarf expresion opcode DW_OP_piece 3755564188) at mm/truncate.c:219 #6 0xc015b405 in vmtruncate (inode=0xf5ed7780, offset=warning: Unhandled dwarf expresion opcode DW_OP_piece 444096111) at mm/memory.c:1587 #7 0xc0186da8 in inode_setattr (inode=0xf5ed7780, attr=0xee70cf4c) at fs/attr.c:74 #8 0xc01d5cf9 in ext3_setattr (dentry=0xc5e98030, attr=0xee70cf4c) at fs/ext3/inode.c:2807 #9 0xc0187038 in notify_change (dentry=0xf6e88414, attr=0xee70cf4c) at fs/attr.c:160 #10 0xc0168bd3 in do_truncate (dentry=0xf6e88414, length=warning: Unhandled dwarf expresion opcode DW_OP_piece 3755564188) at fs/open.c:210 #11 0xc016920b in sys_ftruncate64 (fd=3320414256, length=warning: Unhandled dwarf expresion opcode DW_OP_piece 444096111) at fs/open.c:317 #12 0xc01040bf in sysenter_past_esp () at include/asm/current.h:9 #13 0x00000005 in ?? () #14 0x1a785e6f in ?? () #15 0x00000000 in ?? () ----------------------------- (gdb) list journal_unmap_buffer 1734 * 1735 * We're outside-transaction here. Either or both of j_running_transaction 1736 * and j_committing_transaction may be NULL. 1737 */ 1738 static int journal_unmap_buffer(journal_t *journal, struct buffer_head *bh) 1739 { 1740 transaction_t *transaction; 1741 struct journal_head *jh; 1742 int may_free = 1; 1743 int ret; ----------------------------- The first oops reported looks a lot like bug #4987. Both appear to be trapping on a store of a 16-bit value. Both show the current instruction without the 66 prefix, making it look like a 32-bit store. (Does the processor increment the IP past the prefix?) I don't why we're seeing an oops. In both cases the register being dereferenced has already been dereferenced. In neither case should the structure be freed by another thread. Both bug reports are missing the beginning part of the oops text. The second oops is in ext3, and I haven't looked too closely at it. I am still not able to replicate the bug in 2.6.13-rc4 kernel, will update you on encountering the problem Sooo... where do we stand on this bug now? This is still under investigation. The problem doesn't appear in 2.6.13-rc4, but a similar oops happens in 2.6.13-rc4-mm1 (bug #4987). So, I took a quick look at this, compiled 2.6.13-rc3-mm2 on a P4 Xeon box -- though I don't know exactly what options were in the config so my version could be slightly different from the version Sharyathi compiled. I haven't reproduced the problem myself so hopefully this isn't all totally bogus analysis ;-) It looks like we could be oopsing in this piece of code: /* initialize type dependent area for linelock */ linelock = (struct linelock *) & tlck->lock; linelock->next = 0; linelock->flag = tlckLINELOCK; linelock->maxcnt = TLOCKSHORT; linelock->index = 0; from jfs_txnmgr.c: lines 765-770 Here's output of objdump -ld on jfs_txnmgr.o looking at roughly the area where the oops occurs: fs/jfs/jfs_txnmgr.c:747 1180: 8b 5c 24 10 mov 0x10(%esp),%ebx 1184: 66 89 73 1a mov %si,0x1a(%ebx) fs/jfs/jfs_txnmgr.c:766 1188: 8d 55 10 lea 0x10(%ebp),%edx fs/jfs/jfs_txnmgr.c:767 118b: 66 c7 45 10 00 00 movw $0x0,0x10(%ebp) fs/jfs/jfs_txnmgr.c:768 1191: 66 c7 42 04 00 20 movw $0x2000,0x4(%edx) fs/jfs/jfs_txnmgr.c:769 1197: c6 42 02 14 movb $0x14,0x2(%edx) fs/jfs/jfs_txnmgr.c:770 119b: c6 42 03 00 movb $0x0,0x3(%edx) To me it looked like the oopsing instruction was at 0x1191 where we first dereference %edx. I'm confused why the instruction before that dereferences %ebp (instead of %edx) for line 767. Basically, this would mean that the linelock pointer is either somehow invalid, which seems to be unlikely, or possibly incorrectly aligned? You're playing some interesting games with the whole overlay thing, maybe a union would be more appropriate there? I'll try and reproduce on a my xeon machine... I think the code looks closer to lines 757-761: 757 jfs_ip->atltail = lid; 758 TXN_LOCK(); 759 list_add_tail(&jfs_ip->anon_inode_list, 760 &TxAnchor.anon_list); 761 TXN_UNLOCK(); fs/jfs/jfs_txnmgr.c:757 12eb: 8b 45 a0 mov 0xffffffa0(%ebp),%eax 12ee: 66 89 43 50 mov %ax,0x50(%ebx) fs/jfs/jfs_txnmgr.c:758 12f2: b8 1c 00 00 00 mov $0x1c,%eax 12f7: e8 fc ff ff ff call 12f8 <txLock+0x5f8> include/linux/list.h:54 12fc: c7 43 5c 6c 00 00 00 movl $0x6c,0x5c(%ebx) include/linux/list.h:52 1303: 8b 15 70 00 00 00 mov 0x70,%edx include/linux/list.h:81 1309: 89 d8 mov %ebx,%eax 130b: 83 c0 5c add $0x5c,%eax include/linux/list.h:53 130e: a3 70 00 00 00 mov %eax,0x70 include/linux/list.h:56 1313: 89 02 mov %eax,(%edx) include/linux/list.h:55 1315: 89 50 04 mov %edx,0x4(%eax) fs/jfs/jfs_txnmgr.c:761 1318: b8 1c 00 00 00 mov $0x1c,%eax 131d: e8 fc ff ff ff call 131e <txLock+0x61e> 1322: e9 d2 fe ff ff jmp 11f9 <txLock+0x4f9> It's hard to tell for sure, since it compiled differently for each of us. The actual instruction it trapped on is: 89 72 50 mov %esi,0x50(%edx) but I believe the real instruction is 66 89 72 50 mov %si,0x50(%edx) This would be consistent with what I observed in bug #4987, although I have no idea how we would have jumped into the middle of an instruction. Okay, I completely misunderstood the end of the oops. I thought the Code dump started at EIP, but it starts 43 bytes before EIP. ksymoops figures it out, but since the oops output was mangled, I didn't catch it. I finally figured out what was happening and this is the same as bug #4987. A patch can be found there. *** This bug has been marked as a duplicate of 4987 *** |