Distribution: SLES 9 SP 1 ----------------------------- Hardware Environment: 2way, Intel(R) XEON(TM) CPU 2.00GHz Network Interface Broadcom Corporation NetXtreme BCM5703X Gigabit Ethernet(Tigon3) Disk I/O: SCSI storage controller: LSI Logic / Symbios Logic 53c1030 PCI-X Fusion-MPT Dual Ultra320 SCSI (rev 07) ----------------------------- Software Environment: Linux x335a 2.6.12-rc5-mm2-I #1 SMP Mon Jun 6 20:31:59 IST 2005 i686 i686 i386 GNU/Linux ----------------------------- Problem Description: While running LTP test suit, system came out with the following OOPs message. The OOps has occured while executing ftest03 test, whic is used to test file I/O with readv and writev system calls. Since kdump has problem capturing the dump (bug: 4720) i couldn't take the kernel dump. [ 4241.176305] Oops: 0000 [#1] [ 4241.185481] PREEMPT SMP DEBUG_PAGEALLOC [ 4241.198150] Modules linked in: [ 4241.208202] CPU: 1 [ 4241.208203] EIP: 0060:[<c014a6c9>] Not tainted VLI [ 4241.208205] EFLAGS: 00010246 (2.6.12-rc5-mm2-I) [ 4241.248845] EIP is at generic_file_buffered_write+0x5d9/0x650 [ 4241.267715] eax: 00000080 ebx: 00000000 ecx: 00000000 edx: ca02f000 [ 4241.290020] esi: 00000800 edi: 00000800 ebp: cc5a1d70 esp: cc5a1cc8 [ 4241.312324] ds: 007b es: 007b ss: 0068 [ 4241.325778] Process ftest03 (pid: 28251, threadinfo=cc5a0000 task=f33b0030) [ 4241.348080] Stack: f5d0bb80 c6cfa900 00000000 00000800 00000286 000000a2 ca02f000 ca02ef80 [ 4241.375760] 00000800 c6cfa900 00000000 08055948 00000800 f3b154f8 c0494420 f3b155b0 [ 4241.403440] f5d0bb80 000a2800 00000000 00000000 00000000 ca02f000 31a6a479 00000000 [ 4241.431120] Call Trace: [ 4241.439741] [<c010414b>] show_stack+0xab/0xc0 [ 4241.454386] [<c01042e4>] show_registers+0x164/0x1d0 [ 4241.470744] [<c010450d>] die+0x10d/0x1b0 [ 4241.483960] [<c0116ef7>] do_page_fault+0x4b7/0x6c8 [ 4241.500034] [<c0103d53>] error_code+0x4f/0x54 [ 4241.514681] [<c014aa56>] __generic_file_aio_write_nolock+0x316/0x590 [ 4241.535895] [<c014ae10>] __generic_file_write_nolock+0x90/0xb0 [ 4241.555397] [<c014b194>] generic_file_writev+0x44/0xc0 [ 4241.572638] [<c0169b19>] do_readv_writev+0x2c9/0x2e0 [ 4241.589283] [<c0169be5>] vfs_writev+0x55/0x60 [ 4241.603927] [<c0169ce7>] sys_writev+0x47/0xb0 [ 4241.618572] [<c01031bf>] sysenter_past_esp+0x54/0x75 [ 4241.635217] Code: d0 29 d8 39 c8 0f 47 c1 01 c3 29 c1 39 da 75 e2 83 85 70 ff ff ff 08 31 db 85 c9 7
This bug is noticed even in 2.6.12-rc6-mm1
I encoutered the same bug On machine with these specificaions 1way, Pentium IV 2.8GHz. 2G RAM Network Interface (e1000) Disk I/O: SCSI storage controller: Adaptec Ultra320 on kernel 2.6.12-rc6-mm1, Since I had enabled kdump I was able to take dump and I have done preliminary analysis of the dump. I will attach my preliminary analysis of the dump This is the oops message I got -------------------------------------- x206h login: Oops: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC Modules linked in: CPU: 1 EIP: 0060:[<c014982d>] Not tainted VLI EFLAGS: 00010246 (2.6.12-rc6-mm1-I) EIP is at generic_file_buffered_write+0x60d/0x6f0 eax: 00000080 ebx: 00000000 ecx: 00000000 edx: f2815000 esi: 00000800 edi: c4e01920 ebp: f3d8e940 esp: f4a4dce8 ds: 007b es: 007b ss: 0068 Process ftest08 (pid: 17500, threadinfo=f4a4c000 task=f7213020) Stack: f36b5680 c4e01920 00000000 00000800 f2815000 00000000 f2814f80 000000bd f3d8e950 00000800 00000800 00000000 08056170 00000800 f3d8e888 c0545760 f36b5680 000bd800 00000000 00000000 00000000 f2815000 2512557d 00000000 Call Trace: [<c01865d2>] inode_update_time+0x52/0xe0 [<c0149c56>] __generic_file_aio_write_nolock+0x346/0x5d0 [<c014c111>] buffered_rmqueue+0xd1/0x250 [<c014a043>] __generic_file_write_nolock+0xa3/0xd0 [<c04c911d>] _spin_unlock+0xd/0x30 [<c0150d0b>] cache_grow+0x15b/0x1b0 [<c0136d30>] autoremove_wake_function+0x0/0x60 [<c04c911d>] _spin_unlock+0xd/0x30 [<c0150eab>] cache_alloc_refill+0x14b/0x250 [<c014a43a>] generic_file_writev+0x4a/0xe0 [<c02eb4f6>] copy_from_user+0x46/0x80 [<c016956b>] do_readv_writev+0x30b/0x310 [<c04c6bd3>] __down_failed+0x7/0xc [<c0169bf6>] .text.lock.read_write+0x8/0x22 [<c0168510>] generic_file_llseek+0x0/0xe0 [<c0169628>] vfs_writev+0x58/0x60 [<c016972b>] sys_writev+0x4b/0xb0 [<c01031eb>] sysenter_past_esp+0x54/0x75 Code: 04 89 d0 29 d8 39 c8 0f 47 c1 01 c3 29 c1 39 da 75 e4 83 44 24 10 08 31 db 85 c9 75 dd 89 5c 24 5
Created attachment 5179 [details] GDb analysis of the dump taken from kdump This is the preliminary assesment of the dump that I took through kdump. Since I had enabled panic_on_oops kernel booted into secondary kernel on oops. Please let me know if furter analysis of the dump needs to be done.
putting the right component
I am not able to reproduce the problem on 2.6.12-rc6-mm1 by just running ftest03. Is there special config or setup needed ? Does this happen only with stress ? elm3b29:~/ltp-full-20050608/testcases/kernel/fs/ftest # ./ftest03 ftest03 1 PASS : Test passed. elm3b29:~/ltp-full-20050608/testcases/kernel/fs/ftest # uname -a Linux elm3b29 2.6.12-rc6-mm1 #5 SMP Wed Jun 15 09:30:18 PDT 2005 x86_64 x86_64 x86_64 GNU/Linux Anyway, need more analysis on the dump or provide me the dump. It looks like the panic is while doing .. buf = cur_iov->iov_base + iov_base; (line 1980) (assuming that this is correct) it has to be due to deref of cur_iov. Can you tell me what is cur_iov ? And also, I was expecting to see "invalid dereference" panic message rather than "Oops: 0000 [#1]". So can you look at the assembly code at generic_file_buffered_write+0x60d and the registers and tell me if it really is due to derefing a invalid address ? If you need help on this, please provide pointer to dump and vmlinux.
I was able to replicate the bug by individually running the ftest08 test case from the ltp test suite. I will attach the config file of the primary kernel. Suzuki will analyze the dump and update the bug. I will send you a mail regarding access to dumps.
Created attachment 5180 [details] Configuration file for the primary kernel I am attaching the config file for the primary kernel, please have a look
Badari, I could extract some more lines of information from the *log_buf*, which were missing in the Description. <1>Unable to handle kernel paging request at virtual address f2815000\n <1> printing eip:\n<4>c014982d\n <1>*pde = 006a6067\n <1>*pte = 32815000\n <1>Oops: 0000 [#1]\n <4>PREEMPT SMP DEBUG_PAGEALLOC\n Disassembled output shows: 0xc0149823 <generic_file_buffered_write+1539>: mov 0x10(%esp),%edx 0xc0149827 <generic_file_buffered_write+1543>: mov %ebx,%ecx 0xc0149829 <generic_file_buffered_write+1545>: mov %edx,0x54(%esp) 0xc014982d <generic_file_buffered_write+1549>: mov (%edx),%ebx <--- ^^^^^^^^^^ Problem hit here. 0xc014982f <generic_file_buffered_write+1551>: add %ebx,%ecx 0xc0149831 <generic_file_buffered_write+1553>: mov %ecx,0x30(%esp) 0xc0149835 <generic_file_buffered_write+1557>: jmp 0xc01494cc <generic_file_buffered_write+684> 0xc014983a <generic_file_buffered_write+1562>: ud2a (gdb) info registers eax 0x80 128 ecx 0x0 0 edx 0xf2815000 -226406400 <------ ebx 0x0 0 esp 0xf4a4dce8 0xf4a4dce8 ebp 0xf3d8e940 0xf3d8e940 esi 0x800 2048 edi 0xc4e01920 -991946464 eip 0xc014982d 0xc014982d eflags 0x10246 66118 cs 0x60 96 ss 0x68 104 ds 0x7b 123 es 0x7b 123 fs 0x0 0 gs 0x33 51
I think that there was some array overflow. The no. of elements in iov argument is 16, and the code was trying to access 17th iovec. [Switching to thread 2 (process 17500)]#0 generic_file_buffered_write (iocb=0xf4a4de4c, iov=0xf2814f80, nr_segs=16, pos=Unhandled dwarf expression opcode 0x93 ^^^^^^^^^^ (gdb) p &((struct iovec*)0xf2814f80)[16] $6 = (struct iovec *) 0xf281500 Also 0xf2815000 seems to be the first byte of a page.
Thank you Suziki & Sharyathi for the help. Looking at extra "log_buf" messages, disassembly - it really clear that we are going beyond array to access the next iov_base. Before I make a final patch, is it possible to hack the code and see if it works ? Please change (mm/filemap.c line:1980) to if (count) buf = cur_iov->iov_base + iov_base; and see if it fixes the problem. Please let me know.
Bhadri It seems to have fixed the problem I did the code modification and rebuilt the kernel and reran the LTP testsuite and ftest08 test case which used to cause the oops. But I didn't encounter the problem that I had earlier.
Created attachment 5185 [details] iov access fix Fix to make sure we access iov only there is some IO to be done.