Most recent kernel where this bug did *NOT* occur: No idea Distribution: Redhat Enterprise Linux 4 (RHEL4) Hardware Environment: HP ProLiant DL380 G4 (IA32) Software Environment: Dedicated NFS file server (knfsd over 6.4T XFS file system) Problem Description: Kernel Oops here: Jan 10 10:17:51 ss2a kernel: Unable to handle kernel NULL pointer dereference at virtual address 00000000 Jan 10 10:17:51 ss2a kernel: printing eip: Jan 10 10:17:51 ss2a kernel: 00000000 Jan 10 10:17:51 ss2a kernel: *pde = dda19067 Jan 10 10:17:51 ss2a kernel: Oops: 0000 [#1] ... Jan 10 10:17:51 ss2a kernel: CPU: 0 Jan 10 10:17:51 ss2a kernel: EIP: 0060:[<00000000>] Not tainted VLI Jan 10 10:17:51 ss2a kernel: EFLAGS: 00010286 (2.6.9-42.0.2.EL.R7smp) Jan 10 10:17:51 ss2a kernel: EIP is at 0x0 Jan 10 10:17:51 ss2a kernel: eax: edfef1b4 ebx: c042f660 ecx: 00000000 edx: d42fe4f0 Jan 10 10:17:51 ss2a kernel: esi: d42fe4f0 edi: f1ef1ef8 ebp: 00000000 esp: f1ef1edc Jan 10 10:17:51 ss2a kernel: ds: 007b es: 007b ss: 0068 Jan 10 10:17:51 ss2a kernel: Process nfsd (pid: 3560, threadinfo=f1ef0000 task=f1ec8f10) Jan 10 10:17:51 ss2a kernel: Stack: c0160bbd edfef1b4 ffffffff f60ec07b 636378f7 f518ef58 c0160c35 636378f7 Jan 10 10:17:51 ss2a kernel: 00000013 f60ec068 c033b240 f518ef58 f518ef58 00000013 f60eb004 c0160c4b Jan 10 10:17:51 ss2a kernel: 00000000 f8c69e13 f60ec068 c31e2e00 f7dc1e40 c31e2e00 f8c1838a f60eb800 Jan 10 10:17:51 ss2a kernel: Call Trace: Jan 10 10:17:51 ss2a kernel: [<c0160bbd>] __lookup_hash+0x70/0x89 Jan 10 10:17:51 ss2a kernel: [<c0160c35>] lookup_one_len_it+0x58/0x67 Jan 10 10:17:51 ss2a kernel: [<c0160c4b>] lookup_one_len+0x7/0x9 Jan 10 10:17:51 ss2a kernel: [<f8c69e13>] nfsd_lookup+0x304/0x390 [nfsd] Jan 10 10:17:51 ss2a kernel: [<f8c1838a>] svcauth_unix_set_client+0xa7/0xb5 [sunrpc] Jan 10 10:17:51 ss2a kernel: [<f8c720ad>] nfsd3_proc_lookup+0xa9/0xb3 [nfsd] Jan 10 10:17:51 ss2a kernel: [<f8c740c8>] nfs3svc_decode_diropargs+0x0/0xfa [nfsd] Jan 10 10:17:51 ss2a kernel: [<f8c67685>] nfsd_dispatch+0xba/0x16d [nfsd] Jan 10 10:17:51 ss2a kernel: [<f8c15559>] svc_process+0x432/0x6e1 [sunrpc] Jan 10 10:17:51 ss2a kernel: [<f8c6745e>] nfsd+0x1cc/0x339 [nfsd] Jan 10 10:17:51 ss2a kernel: [<f8c67292>] nfsd+0x0/0x339 [nfsd] Jan 10 10:17:51 ss2a kernel: [<c01041f5>] kernel_thread_helper+0x5/0xb Jan 10 10:17:51 ss2a kernel: Code: Bad EIP value. Jan 10 10:17:51 ss2a kernel: <0>Fatal exception: panic in 5 seconds ... Calls to symbols found in System.map suggests this is the correct 'area', my attempt at ASM disassembly 70 bytes into __lookup_hash() (0xc0160bbd) shows: 0xc0160ba2: call 0xc0168d13 ; call - 'd_alloc()' 0xc0160ba7: test %eax,%eax ; test for zero/null pointer 'if (!new)' 0xc0160ba9: mov %eax,%esi ; dentry = ERR_PTR(-ENOMEM); 0xc0160bab: je 0xc0160bce ; jump if equal to VA 0xc0160bce - 'goto out;' 0xc0160bad: mov (%esp),%eax ; parameter - 'nd,' 0xc0160bb0: mov %ebp,%ecx ; parameter - 'new,' 0xc0160bb2: mov %esi,%edx ; parameter - 'inode' 0xc0160bb4: mov 0x90(%eax),%ebx ; inode->i_op-> 0xc0160bba: call *0x4(%ebx) ; FS specific lookup - lookup(); 0xc0160bbd: test %eax,%eax ; test for zero/null pointer - 'if (!dentry)' ; ... 0xc0160bce: pop %esi ; begin to restore stack - '} out:' ; ... 0xc0160bd5: ret ; return to caller (exit function) 'return dentry;}' Which corresponds to this section of the __lookup_hash() function in /usr/src/linux/fs/namei.c: if (!dentry) { struct dentry *new = d_alloc(base, name); dentry = ERR_PTR(-ENOMEM); if (!new) goto out; dentry = inode->i_op->lookup(inode, new, nd); if (!dentry) dentry = new; else dput(new); } Steps to reproduce: Cannot reproduce easily in a production env. but we see this on at least one server out of 50 approximately once a week. We use large XFS file systems exported via knfsd and gets used heavily (large files, lots of I/O). I believe I have diagnosed correctly (if not I'd appreciate a pointer as to where the *real* problem is) and have provided this patch (below). But I cannot suggested why it is happening. Someone with more experience with this FS (VFS?) layer could perhaps suggest why this pointer is NULL and indeed the FS dependent call to lookup() via i_op is essentially corrupt. Indeed EIP does seem to want to execute 0 because the function pointer lookup() dereferenced by inode->i_op is never checked for NULL - neither, in fact, are the inode and i_op pointers. Patch: --- namei.c 2007-01-10 17:33:54.133034000 +0000 +++ namei.jvanns 2007-01-11 11:39:02.201418217 +0000 @@ -1280,7 +1280,15 @@ dentry = ERR_PTR(-ENOMEM); if (!new) goto out; - dentry = inode->i_op->lookup(inode, new, nd); + + if (!inode) + printk ("FSCFC: __lookup_hash(): NULL pointer 'inode' on line %d.\n", __LINE__); + else if (!inode->i_op) + printk ("FSCFC: __lookup_hash(): NULL pointer 'inode->i_op' (Dev #: %0#8x, Inode #: %lu) on line %d.\n", + inode->i_rdev, inode->i_ino, __LINE__); + else + dentry = inode->i_op->lookup(inode, new, nd); + if (!dentry) dentry = new; else Ignore printk() statements if needed - this is for our purposes. Regards, Jim Vanns
Created attachment 10056 [details] Patch to namei.c for testing NULL pointers.
Comment on attachment 10056 [details] Patch to namei.c for testing NULL pointers. PLEASE IGNORE THIS PATCH. CHOOSE THE OTHER, NEWER PATCH.
Created attachment 10057 [details] Patch to namei.c for testing NULL pointers. New patch actually tests lookup() for NULL.
I have this same oops, and reported it to lkml. Daniel Blueman (daniel dot blueman at gmail) suggested that 4KSTACKS might be triggering this. My kernel, being the stock RHEL kernel with XFS enabled, has 4KSTACKS enabled. I will try running without 4KSTACKS, but it's hard to trigger this oops.
Sorry this should have been closed ages ago - RHEL specific kernel bugs should go to the vendor and won't be handled here as the RHEL kernel has deviated quite a bit from upstream as it is an older kernel with backports