Bug 4721
Summary: | On running LTP suite, kernel came out with OOPs while running ftest03 test | ||
---|---|---|---|
Product: | File System | Reporter: | Sharyathi (sharyathi) |
Component: | VFS | Assignee: | other_other |
Status: | RESOLVED CODE_FIX | ||
Severity: | high | CC: | akpm, bnpoorni, maneesh, pbadari, sglass, suparna, suzuki |
Priority: | P2 | ||
Hardware: | i386 | ||
OS: | Linux | ||
Kernel Version: | 2.6.12-rc5-mm2 | Subsystem: | |
Regression: | --- | Bisected commit-id: | |
Attachments: |
GDb analysis of the dump taken from kdump
Configuration file for the primary kernel iov access fix |
Description
Sharyathi
2005-06-08 03:37:01 UTC
This bug is noticed even in 2.6.12-rc6-mm1 I encoutered the same bug On machine with these specificaions 1way, Pentium IV 2.8GHz. 2G RAM Network Interface (e1000) Disk I/O: SCSI storage controller: Adaptec Ultra320 on kernel 2.6.12-rc6-mm1, Since I had enabled kdump I was able to take dump and I have done preliminary analysis of the dump. I will attach my preliminary analysis of the dump This is the oops message I got -------------------------------------- x206h login: Oops: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC Modules linked in: CPU: 1 EIP: 0060:[<c014982d>] Not tainted VLI EFLAGS: 00010246 (2.6.12-rc6-mm1-I) EIP is at generic_file_buffered_write+0x60d/0x6f0 eax: 00000080 ebx: 00000000 ecx: 00000000 edx: f2815000 esi: 00000800 edi: c4e01920 ebp: f3d8e940 esp: f4a4dce8 ds: 007b es: 007b ss: 0068 Process ftest08 (pid: 17500, threadinfo=f4a4c000 task=f7213020) Stack: f36b5680 c4e01920 00000000 00000800 f2815000 00000000 f2814f80 000000bd f3d8e950 00000800 00000800 00000000 08056170 00000800 f3d8e888 c0545760 f36b5680 000bd800 00000000 00000000 00000000 f2815000 2512557d 00000000 Call Trace: [<c01865d2>] inode_update_time+0x52/0xe0 [<c0149c56>] __generic_file_aio_write_nolock+0x346/0x5d0 [<c014c111>] buffered_rmqueue+0xd1/0x250 [<c014a043>] __generic_file_write_nolock+0xa3/0xd0 [<c04c911d>] _spin_unlock+0xd/0x30 [<c0150d0b>] cache_grow+0x15b/0x1b0 [<c0136d30>] autoremove_wake_function+0x0/0x60 [<c04c911d>] _spin_unlock+0xd/0x30 [<c0150eab>] cache_alloc_refill+0x14b/0x250 [<c014a43a>] generic_file_writev+0x4a/0xe0 [<c02eb4f6>] copy_from_user+0x46/0x80 [<c016956b>] do_readv_writev+0x30b/0x310 [<c04c6bd3>] __down_failed+0x7/0xc [<c0169bf6>] .text.lock.read_write+0x8/0x22 [<c0168510>] generic_file_llseek+0x0/0xe0 [<c0169628>] vfs_writev+0x58/0x60 [<c016972b>] sys_writev+0x4b/0xb0 [<c01031eb>] sysenter_past_esp+0x54/0x75 Code: 04 89 d0 29 d8 39 c8 0f 47 c1 01 c3 29 c1 39 da 75 e4 83 44 24 10 08 31 db 85 c9 75 dd 89 5c 24 5 Created attachment 5179 [details]
GDb analysis of the dump taken from kdump
This is the preliminary assesment of the dump that I took through kdump. Since
I had enabled panic_on_oops kernel booted into secondary kernel on oops. Please
let me know if furter analysis of the dump needs to be done.
putting the right component I am not able to reproduce the problem on 2.6.12-rc6-mm1 by just running ftest03. Is there special config or setup needed ? Does this happen only with stress ? elm3b29:~/ltp-full-20050608/testcases/kernel/fs/ftest # ./ftest03 ftest03 1 PASS : Test passed. elm3b29:~/ltp-full-20050608/testcases/kernel/fs/ftest # uname -a Linux elm3b29 2.6.12-rc6-mm1 #5 SMP Wed Jun 15 09:30:18 PDT 2005 x86_64 x86_64 x86_64 GNU/Linux Anyway, need more analysis on the dump or provide me the dump. It looks like the panic is while doing .. buf = cur_iov->iov_base + iov_base; (line 1980) (assuming that this is correct) it has to be due to deref of cur_iov. Can you tell me what is cur_iov ? And also, I was expecting to see "invalid dereference" panic message rather than "Oops: 0000 [#1]". So can you look at the assembly code at generic_file_buffered_write+0x60d and the registers and tell me if it really is due to derefing a invalid address ? If you need help on this, please provide pointer to dump and vmlinux. I was able to replicate the bug by individually running the ftest08 test case from the ltp test suite. I will attach the config file of the primary kernel. Suzuki will analyze the dump and update the bug. I will send you a mail regarding access to dumps. Created attachment 5180 [details]
Configuration file for the primary kernel
I am attaching the config file for the primary kernel, please have a look
Badari, I could extract some more lines of information from the *log_buf*, which were missing in the Description. <1>Unable to handle kernel paging request at virtual address f2815000\n <1> printing eip:\n<4>c014982d\n <1>*pde = 006a6067\n <1>*pte = 32815000\n <1>Oops: 0000 [#1]\n <4>PREEMPT SMP DEBUG_PAGEALLOC\n Disassembled output shows: 0xc0149823 <generic_file_buffered_write+1539>: mov 0x10(%esp),%edx 0xc0149827 <generic_file_buffered_write+1543>: mov %ebx,%ecx 0xc0149829 <generic_file_buffered_write+1545>: mov %edx,0x54(%esp) 0xc014982d <generic_file_buffered_write+1549>: mov (%edx),%ebx <--- ^^^^^^^^^^ Problem hit here. 0xc014982f <generic_file_buffered_write+1551>: add %ebx,%ecx 0xc0149831 <generic_file_buffered_write+1553>: mov %ecx,0x30(%esp) 0xc0149835 <generic_file_buffered_write+1557>: jmp 0xc01494cc <generic_file_buffered_write+684> 0xc014983a <generic_file_buffered_write+1562>: ud2a (gdb) info registers eax 0x80 128 ecx 0x0 0 edx 0xf2815000 -226406400 <------ ebx 0x0 0 esp 0xf4a4dce8 0xf4a4dce8 ebp 0xf3d8e940 0xf3d8e940 esi 0x800 2048 edi 0xc4e01920 -991946464 eip 0xc014982d 0xc014982d eflags 0x10246 66118 cs 0x60 96 ss 0x68 104 ds 0x7b 123 es 0x7b 123 fs 0x0 0 gs 0x33 51 I think that there was some array overflow. The no. of elements in iov argument is 16, and the code was trying to access 17th iovec. [Switching to thread 2 (process 17500)]#0 generic_file_buffered_write (iocb=0xf4a4de4c, iov=0xf2814f80, nr_segs=16, pos=Unhandled dwarf expression opcode 0x93 ^^^^^^^^^^ (gdb) p &((struct iovec*)0xf2814f80)[16] $6 = (struct iovec *) 0xf281500 Also 0xf2815000 seems to be the first byte of a page. Thank you Suziki & Sharyathi for the help. Looking at extra "log_buf" messages, disassembly - it really clear that we are going beyond array to access the next iov_base. Before I make a final patch, is it possible to hack the code and see if it works ? Please change (mm/filemap.c line:1980) to if (count) buf = cur_iov->iov_base + iov_base; and see if it fixes the problem. Please let me know. Bhadri It seems to have fixed the problem I did the code modification and rebuilt the kernel and reran the LTP testsuite and ftest08 test case which used to cause the oops. But I didn't encounter the problem that I had earlier. Created attachment 5185 [details]
iov access fix
Fix to make sure we access iov only there is some IO to be done.
|