Bug 4987
Summary: | Kernel OOPs while running Files System Stress on JFS | ||
---|---|---|---|
Product: | File System | Reporter: | Sharyathi (sharyathi) |
Component: | JFS | Assignee: | Dave Kleikamp (shaggy) |
Status: | CLOSED CODE_FIX | ||
Severity: | high | CC: | bnpoorni, maneesh, sglass |
Priority: | P2 | ||
Hardware: | i386 | ||
OS: | Linux | ||
Kernel Version: | 2.6.13-rc4-mm1 | Subsystem: | |
Regression: | --- | Bisected commit-id: | |
Attachments: |
Analysis of dump taken through the kdump
Log buff out put from the dump Fix race in txLock() |
Description
Sharyathi
2005-08-02 23:54:32 UTC
Created attachment 5487 [details]
Analysis of dump taken through the kdump
I am attaching the preliminary analysis of the dump taken through kdump. Please
let us know if you need more information from the dump
*** Bug 4986 has been marked as a duplicate of this bug. *** It would be helpful to have the top of the oops message. There is usually useful information before the line, "PREEMPT SMP ...". I'm not sure how far I can trust the oops output as to where it trapped. The code disassembles to: 00000000 <_EIP>: Code; c024fbfc No symbols available <===== 0: 89 41 1a mov %eax,0x1a(%ecx) <===== Code; c024fbff No symbols available 3: 66 c7 43 50 00 00 movw $0x0,0x50(%ebx) Code; c024fc05 No symbols available 9: b8 78 6b 51 c0 mov $0xc0516b78,%eax Code; c024fc0a No symbols available e: 66 c7 43 4e 00 00 movw $0x0,0x4e(%ebx) Code; c024fc10 No symbols available 14: e8 c6 8a 24 00 call 248adf <_EIP+0x248adf> But, if I compile jfs_txnmgr.c myself, the first instruction should have a '66' prefix, making it: 0: 66 89 41 1a mov %ax,0x1a(%ecx) <===== Does this mean that the processor jumped into the middle of the instruction, or is the oops processing in the kernel not quite accurate? It would be helpful to see the full oops message, though. Dave I hit upon this bug again, But my console is not able to show me any more information, above PREEMPT SMP, apart from Oops: 0002[#] Here again I am attaching the complete Oops message. ------------------------------------ Oops: 0002 [#] PREEMPT SMP DEBUG_PAGEALLOC last sysfs file: /kernel/crash_notes Modules linked in: CPU: 1 EIP: 0060:[<c024fbfc>] Not tainted VLI EFLAGS: 00010202 (2.6.13-rc4-mm1-I) EIP is at txCommit+0x16c/0x300 eax: dc862108 ebx: dc8620ac ecx: f395c4ac edx: c05e8a6c esi: dc8622d8 edi: 0000a027 ebp: e5053b70 esp: e5053b00 ds: 007b es: 007b ss: 0068 Process dd (pid: 11719, threadinfo=e5052000 task=f54ff530) Stack: d5238e00 e5053b14 c04986c9 e5053b14 00000001 00000000 f882bfe8 f37fda80 0000030a 0000030a 00000000 f37fda80 f7e82200 00000001 e5053b98 00241037 00000000 00000000 00000000 c5ab5ca0 00000000 00000000 00000000 00000000 Call Trace: [<c0103fcb>] show_stack+0xab/0xc0 [<c0104164>] show_registers+0x164/0x1d0 [<c01043a2>] die+0x122/0x1c0 [<c04996dc>] do_page_fault+0x3dc/0x61d [<c0103c53>] error_code+0x4f/0x54 [<c023044b>] jfs_commit_inode+0x10b/0x110 [<c02304a4>] jfs_write_inode+0x54/0x70 [<c018be94>] write_inode+0x44/0x50 [<c018bfbf>] __sync_single_inode+0x11f/0x230 [<c018c14f>] __writeback_single_inode+0x7f/0x150 [<c018c408>] generic_sync_sb_inodes+0x1e8/0x300 [<c018c54d>] sync_sb_inodes+0x2d/0x40 [<c018c625>] writeback_inodes+0xc5/0xd0 [<c014b738>] balance_dirty_pages+0x48/0x150 [<c014b8bd>] balance_dirty_pages_ratelimited+0x7d/0xb0 [<c0147204>] generic_file_buffered_write+0x2b4/0x680 [<c01478ea>] __generic_file_aio_write_nolock+0x31a/0x580 [<c0147c90>] __generic_file_write_nolock+0x90/0xb0 [<c0147ea2>] generic_file_write+0x52/0xd0 [<c0166258>] vfs_write+0xb8/0x190 [<c01663fb>] sys_write+0x4b/0x80 [<c01030ff>] sysenter_past_esp+0x54/0x75 Code: 89 41 1a 66 c7 43 50 00 00 b8 78 6b 51 c0 66 c7 43 4e 00 00 e8 c6 8a 24 00 8d 86 30 fe ff ff 8b 9 Linux version 2.6.13-rc4-mm1-II (root@x235b) (gcc version 3.3.3 (SuSE Linux)) #1 PREEMPT Tue Aug 2 08:5 BIOS-provided physical RAM map: ## Secondary Kernel booting ------------------------------------ I have memory dump with me let me know if you nees any information from dump. Sharyathi, If you have the dump you could get the entire log_buf and could probably see the missing line in oops message. Use "x/100s log_buf" to dump the log_buf from dump in gdb. Created attachment 5501 [details]
Log buff out put from the dump
This is the out put of log buffer from the dump I had taken.
I was able to reproduce the problem. movzwl 78(%esi), %eax cmpw $0, 26(%edx) movw %ax, 24(%edx) jne .L474 movzwl 80(%esi), %eax movw %ax, 26(%edx) The instruction pointer is 1 byte into the above instruction, just past the 66 prefix that makes this a 16-bit store. The contents of eax are c16fe7f0, which should not be possible, cine the previous instruction should have zeroed the high order 16 bits. The only explanation, is that the processor took a bad branch or return into the wrong place. .L474: movw $0, 80(%esi) movl $jfsTxnLock, %eax movw $0, 78(%esi) call _spin_lock leal -472(%ebx), %eax movl -472(%ebx), %edx I recreated the oops using kgdb, and this time it shows me I'm at line 1246 of jfs_txnmgr.c: list_del_init(&jfs_ip->anon_inode_list); jfs_ip->anon-inode_list.prev looks like a valid address, but it can't be accessed, so it looks like an inode was freed without being taken off the anon_inode_list. I should be able to figure it out from here. Created attachment 5583 [details]
Fix race in txLock()
I believe this patch fixes the problem. It took a while to hit the oops, so I'm going to keep running for a while to make sure. *** Bug 4963 has been marked as a duplicate of this bug. *** |