Bug 4721 - On running LTP suite, kernel came out with OOPs while running ftest03 test
Summary: On running LTP suite, kernel came out with OOPs while running ftest03 test
Status: RESOLVED CODE_FIX
Alias: None
Product: File System
Classification: Unclassified
Component: VFS (show other bugs)
Hardware: i386 Linux
: P2 high
Assignee: other_other
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2005-06-08 03:37 UTC by Sharyathi
Modified: 2005-06-17 15:35 UTC (History)
7 users (show)

See Also:
Kernel Version: 2.6.12-rc5-mm2
Subsystem:
Regression: ---
Bisected commit-id:


Attachments
GDb analysis of the dump taken from kdump (4.07 KB, text/plain)
2005-06-15 03:58 UTC, Sharyathi
Details
Configuration file for the primary kernel (29.20 KB, text/plain)
2005-06-16 02:15 UTC, Sharyathi
Details
iov access fix (428 bytes, patch)
2005-06-17 15:35 UTC, Badari Pulavarty
Details | Diff

Description Sharyathi 2005-06-08 03:37:01 UTC
Distribution:
 SLES 9 SP 1
-----------------------------
Hardware Environment:
 2way, Intel(R) XEON(TM) CPU 2.00GHz
 Network Interface Broadcom Corporation NetXtreme BCM5703X Gigabit    
 Ethernet(Tigon3)
 Disk I/O: SCSI storage controller: LSI Logic / Symbios Logic 53c1030 PCI-X 
 Fusion-MPT Dual Ultra320 SCSI (rev 07)
-----------------------------
 Software Environment:
 Linux x335a 2.6.12-rc5-mm2-I #1 SMP Mon Jun 6 20:31:59 IST 2005 i686 i686 i386
 GNU/Linux
-----------------------------
 Problem Description:
 While running LTP test suit, system came out with the following OOPs message.
 The OOps has occured while executing ftest03 test, whic is used to test file   
 I/O  with readv and writev system calls.
 Since kdump has problem capturing the dump (bug: 4720) i couldn't take the 
 kernel dump.

 [ 4241.176305] Oops: 0000 [#1]
 [ 4241.185481] PREEMPT SMP DEBUG_PAGEALLOC
 [ 4241.198150] Modules linked in:
 [ 4241.208202] CPU:    1
 [ 4241.208203] EIP:    0060:[<c014a6c9>]    Not tainted VLI
 [ 4241.208205] EFLAGS: 00010246   (2.6.12-rc5-mm2-I)
 [ 4241.248845] EIP is at generic_file_buffered_write+0x5d9/0x650
 [ 4241.267715] eax: 00000080   ebx: 00000000   ecx: 00000000   edx: ca02f000
 [ 4241.290020] esi: 00000800   edi: 00000800   ebp: cc5a1d70   esp: cc5a1cc8
 [ 4241.312324] ds: 007b   es: 007b   ss: 0068
 [ 4241.325778] Process ftest03 (pid: 28251, threadinfo=cc5a0000 task=f33b0030)
 [ 4241.348080] Stack: f5d0bb80 c6cfa900 00000000 00000800 00000286 000000a2    
ca02f000 ca02ef80
 [ 4241.375760]        00000800 c6cfa900 00000000 08055948 00000800 f3b154f8 
c0494420 f3b155b0
 [ 4241.403440]        f5d0bb80 000a2800 00000000 00000000 00000000 ca02f000 
31a6a479 00000000
 [ 4241.431120] Call Trace:
 [ 4241.439741]  [<c010414b>] show_stack+0xab/0xc0
 [ 4241.454386]  [<c01042e4>] show_registers+0x164/0x1d0
 [ 4241.470744]  [<c010450d>] die+0x10d/0x1b0
 [ 4241.483960]  [<c0116ef7>] do_page_fault+0x4b7/0x6c8
 [ 4241.500034]  [<c0103d53>] error_code+0x4f/0x54
 [ 4241.514681]  [<c014aa56>] __generic_file_aio_write_nolock+0x316/0x590
 [ 4241.535895]  [<c014ae10>] __generic_file_write_nolock+0x90/0xb0
 [ 4241.555397]  [<c014b194>] generic_file_writev+0x44/0xc0
 [ 4241.572638]  [<c0169b19>] do_readv_writev+0x2c9/0x2e0
 [ 4241.589283]  [<c0169be5>] vfs_writev+0x55/0x60
 [ 4241.603927]  [<c0169ce7>] sys_writev+0x47/0xb0
 [ 4241.618572]  [<c01031bf>] sysenter_past_esp+0x54/0x75
 [ 4241.635217] Code: d0 29 d8 39 c8 0f 47 c1 01 c3 29 c1 39 da 75 e2 83 85 70 
ff ff ff 08 31 db 85 c9 7
Comment 1 Sharyathi 2005-06-10 02:53:05 UTC
This bug is noticed even in 2.6.12-rc6-mm1
Comment 2 Sharyathi 2005-06-15 03:27:12 UTC
I encoutered the same bug 
On machine with these specificaions
   1way, Pentium IV 2.8GHz. 2G RAM
   Network Interface (e1000)
   Disk I/O: SCSI storage controller: Adaptec Ultra320
on kernel 2.6.12-rc6-mm1, 

Since I had enabled kdump I was able to take dump and I have done preliminary
analysis of the dump. I will attach my preliminary analysis of the dump
This is the oops message I got
--------------------------------------
x206h login: Oops: 0000 [#1]
PREEMPT SMP DEBUG_PAGEALLOC
Modules linked in:
CPU:    1
EIP:    0060:[<c014982d>]    Not tainted VLI
EFLAGS: 00010246   (2.6.12-rc6-mm1-I)
EIP is at generic_file_buffered_write+0x60d/0x6f0
eax: 00000080   ebx: 00000000   ecx: 00000000   edx: f2815000
esi: 00000800   edi: c4e01920   ebp: f3d8e940   esp: f4a4dce8
ds: 007b   es: 007b   ss: 0068
Process ftest08 (pid: 17500, threadinfo=f4a4c000 task=f7213020)
Stack: f36b5680 c4e01920 00000000 00000800 f2815000 00000000 f2814f80 000000bd
       f3d8e950 00000800 00000800 00000000 08056170 00000800 f3d8e888 c0545760
       f36b5680 000bd800 00000000 00000000 00000000 f2815000 2512557d 00000000
Call Trace:
 [<c01865d2>] inode_update_time+0x52/0xe0
 [<c0149c56>] __generic_file_aio_write_nolock+0x346/0x5d0
 [<c014c111>] buffered_rmqueue+0xd1/0x250
 [<c014a043>] __generic_file_write_nolock+0xa3/0xd0
 [<c04c911d>] _spin_unlock+0xd/0x30
 [<c0150d0b>] cache_grow+0x15b/0x1b0
 [<c0136d30>] autoremove_wake_function+0x0/0x60
 [<c04c911d>] _spin_unlock+0xd/0x30
 [<c0150eab>] cache_alloc_refill+0x14b/0x250
 [<c014a43a>] generic_file_writev+0x4a/0xe0
 [<c02eb4f6>] copy_from_user+0x46/0x80
 [<c016956b>] do_readv_writev+0x30b/0x310
 [<c04c6bd3>] __down_failed+0x7/0xc
 [<c0169bf6>] .text.lock.read_write+0x8/0x22
 [<c0168510>] generic_file_llseek+0x0/0xe0
 [<c0169628>] vfs_writev+0x58/0x60
 [<c016972b>] sys_writev+0x4b/0xb0
 [<c01031eb>] sysenter_past_esp+0x54/0x75
Code: 04 89 d0 29 d8 39 c8 0f 47 c1 01 c3 29 c1 39 da 75 e4 83 44 24 10 08 31 db
85 c9 75 dd 89 5c 24 5
Comment 3 Sharyathi 2005-06-15 03:58:31 UTC
Created attachment 5179 [details]
GDb analysis of the dump taken from kdump

This is the preliminary assesment of the dump that I took through kdump. Since
I had enabled panic_on_oops kernel booted into secondary kernel on oops. Please
let me know if furter analysis of the dump needs to be done.
Comment 4 Maneesh Soni 2005-06-15 04:59:08 UTC
putting the right component
Comment 5 Badari Pulavarty 2005-06-15 17:46:40 UTC
I am not able to reproduce the problem on 2.6.12-rc6-mm1 by
just running ftest03. Is there special config or setup needed ?
Does this happen only with stress ?

elm3b29:~/ltp-full-20050608/testcases/kernel/fs/ftest # ./ftest03
ftest03     1  PASS  :  Test passed.
 
elm3b29:~/ltp-full-20050608/testcases/kernel/fs/ftest # uname -a
Linux elm3b29 2.6.12-rc6-mm1 #5 SMP Wed Jun 15 09:30:18 PDT 2005 x86_64 x86_64
x86_64 GNU/Linux


Anyway, need more analysis on the dump or provide me the dump.
It looks like the panic is while doing ..

buf = cur_iov->iov_base + iov_base; (line 1980)

(assuming that this is correct) it has to be due to deref of cur_iov.
Can you tell me what is cur_iov ? And also, I was expecting to see
"invalid dereference" panic message rather than  "Oops: 0000 [#1]".
So can you look at the assembly code at generic_file_buffered_write+0x60d
and the registers and tell me if it really is due to derefing a invalid
address ?

If you need help on this, please provide pointer to dump and vmlinux.
Comment 6 Sharyathi 2005-06-16 02:13:31 UTC
I was able to replicate the bug by individually running the ftest08 test case 
from the ltp test suite.
I will attach the config file of the primary kernel.
Suzuki will analyze the dump and update the bug.
I will send you a mail regarding access to dumps.
Comment 7 Sharyathi 2005-06-16 02:15:35 UTC
Created attachment 5180 [details]
Configuration file for the primary kernel 

I am attaching the config file for the primary kernel, please have a look
Comment 8 Suzuki 2005-06-16 02:21:19 UTC
Badari,

I could extract some more lines of information from the *log_buf*, which were
missing in the Description.

<1>Unable to handle kernel paging request at virtual address f2815000\n
<1> printing eip:\n<4>c014982d\n
<1>*pde = 006a6067\n
<1>*pte = 32815000\n
<1>Oops: 0000 [#1]\n
<4>PREEMPT SMP DEBUG_PAGEALLOC\n

Disassembled output shows:

0xc0149823 <generic_file_buffered_write+1539>:  mov    0x10(%esp),%edx
0xc0149827 <generic_file_buffered_write+1543>:  mov    %ebx,%ecx
0xc0149829 <generic_file_buffered_write+1545>:  mov    %edx,0x54(%esp)

0xc014982d <generic_file_buffered_write+1549>:  mov    (%edx),%ebx <---
^^^^^^^^^^ Problem hit here. 
0xc014982f <generic_file_buffered_write+1551>:  add    %ebx,%ecx
0xc0149831 <generic_file_buffered_write+1553>:  mov    %ecx,0x30(%esp)
0xc0149835 <generic_file_buffered_write+1557>:  jmp    0xc01494cc
<generic_file_buffered_write+684>
0xc014983a <generic_file_buffered_write+1562>:  ud2a

(gdb) info registers
eax            0x80     128
ecx            0x0      0
edx            0xf2815000       -226406400 <------

ebx            0x0      0
esp            0xf4a4dce8       0xf4a4dce8
ebp            0xf3d8e940       0xf3d8e940
esi            0x800    2048
edi            0xc4e01920       -991946464
eip            0xc014982d       0xc014982d
eflags         0x10246  66118
cs             0x60     96
ss             0x68     104
ds             0x7b     123
es             0x7b     123
fs             0x0      0
gs             0x33     51


 
Comment 9 Suzuki 2005-06-16 05:08:46 UTC
I think that there was some array overflow.
 
 The no. of elements in iov argument is 16, and the code was trying to access
17th iovec.

[Switching to thread 2 (process 17500)]#0  generic_file_buffered_write
(iocb=0xf4a4de4c, iov=0xf2814f80,
 nr_segs=16, pos=Unhandled dwarf expression opcode 0x93
 ^^^^^^^^^^
   
(gdb) p &((struct iovec*)0xf2814f80)[16]
$6 = (struct iovec *) 0xf281500

Also 0xf2815000 seems to be the first byte of a page.
Comment 10 Badari Pulavarty 2005-06-16 08:02:00 UTC
Thank you Suziki & Sharyathi for the help.

Looking at extra "log_buf" messages, disassembly - it really clear that
we are going beyond array to access the next iov_base.

Before I make a final patch, is it possible to hack the code and see
if it works ?

Please change (mm/filemap.c line:1980) to
        if (count)
            buf = cur_iov->iov_base + iov_base;

and see if it fixes the problem.  Please let me know.
Comment 11 Sharyathi 2005-06-17 02:41:13 UTC
Bhadri 
It seems to have fixed the problem
I did the code modification and rebuilt the kernel and reran the LTP testsuite and  
ftest08 test case which used to cause the oops. But I didn't encounter the
problem that I had earlier.
Comment 12 Badari Pulavarty 2005-06-17 15:35:17 UTC
Created attachment 5185 [details]
iov access fix

Fix to make sure we access iov only there is some IO to be done.

Note You need to log in before you can comment on or make changes to this bug.