Bug 7929 - Oops in __d_lookup (sys_lstat() call path)
Summary: Oops in __d_lookup (sys_lstat() call path)
Status: REJECTED INVALID
Alias: None
Product: File System
Classification: Unclassified
Component: XFS (show other bugs)
Hardware: i386 Linux
: P2 normal
Assignee: Dave Chinner
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2007-02-03 12:23 UTC by Jan "Yenya" Kasprzak
Modified: 2008-02-18 06:01 UTC (History)
6 users (show)

See Also:
Kernel Version: 2.6.14-rc4, 2.6.20-rc7
Subsystem:
Regression: ---
Bisected commit-id:


Attachments
dmesg from the RHEL4 kernel (20.10 KB, text/plain)
2007-06-12 00:42 UTC, Jan "Yenya" Kasprzak
Details

Description Jan "Yenya" Kasprzak 2007-02-03 12:23:01 UTC
Most recent kernel where this bug did *NOT* occur: 2.6.11.10

Distribution: RHEL4 update 3

Hardware Environment: HP DL585 (4x Opteron 848), 26 GB RAM, Compaq Smartarray 5i
(cciss driver) for the root filesystem, QLogic 2312 HBA (qla2xxx driver) for the
rest of storage (an external array - IBM FAStT 600). HW problem is unlikely,
because we have recently replaced most of the system components (incl. mainboard
and the Smartarray)

Software Environment: The root filesystem is ext3, all other filesystems are XFS
on LVM on the IBM array. The server has various tasks, including providing home
directories for ~2200 users via NFS and Samba, mail server, etc.

Problem Description:
After booting to a new kernel, usually within 1 day (often much faster,
depending on the system load), I get the attached Oops. 

See also my year and half old mail to LKML: http://lkml.org/lkml/2005/10/22/13
where I report this problem on 2.6.14-rc4 kernel. It contains another oops dump.
I am running 2.6.11.10, because newer kernels have this problem as well
(including 2.6.20-rc7 I've tested yesterday). I occasionally try to upgrade from
2.6.11.10 to something newer, but have to get back because of this problem.
 
The process in which the oops happens varies - this one is for rsync, but I have
seen others (e.g. nfsd) as well.

I am filling this under VFS, but it might also be XFS related.

The oops is:

Unable to handle kernel paging request at 00000000fffffff4 RIP: 
 [<ffffffff80208dea>] __d_lookup+0x72/0x112
PGD 5cea88067 PUD 0 
Oops: 0000 [46] SMP 
CPU 2 
Modules linked in: ohci_hcd usbcore i2c_amd756 i2c_core k8temp hwmon qla2xxx
amd74xx ide_core
Pid: 32250, comm: rsync Not tainted 2.6.20-rc7 #96
RIP: 0010:[<ffffffff80208dea>]  [<ffffffff80208dea>] __d_lookup+0x72/0x112
RSP: 0018:ffff81012f09bbf8  EFLAGS: 00010202
RAX: 00000000fffffff4 RBX: ffff81067d184080 RCX: 0000000000000016
RDX: 0000000000117295 RSI: 018723826d117295 RDI: ffff8103b4ad7188
RBP: 00000000fffffff4 R08: ffff81012f09be48 R09: 0000000000000246
R10: 0000000000000246 R11: 0000000000000246 R12: ffff8103b4ad7188
R13: ffff81012f09bcc8 R14: 00000000bd3c3a0d R15: 0000000000000006
FS:  00002ad0394e23a0(0000) GS:ffff8104000ef640(0000) knlGS:00000000f6ecebb0
CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
CR2: 00000000fffffff4 CR3: 00000005b8630000 CR4: 00000000000006e0
Process rsync (pid: 32250, threadinfo ffff81012f09a000, task ffff810138a7d850)
Stack:  0000000000000246 0000000000000000 ffff810578fc9015 0000000000000000
 ffff81012f09be48 ffff8103b4ad55e8 0000000000000000 ffff81012f09be48
 ffff81012f09bcc8 ffffffff8020c3ef ffff81012f09be48 ffff8104002ba580
Call Trace:
 [<ffffffff8020c3ef>] do_lookup+0x2a/0x1b8
 [<ffffffff8020971e>] __link_path_walk+0x894/0xccb
 [<ffffffff80299598>] zone_statistics+0x41/0x63
 [<ffffffff8020dc88>] link_path_walk+0x4c/0xc2
 [<ffffffff8020c25e>] do_path_lookup+0x1b0/0x20c
 [<ffffffff802200ac>] __user_walk_fd+0x37/0x4c
 [<ffffffff80238ce2>] vfs_lstat_fd+0x18/0x47
 [<ffffffff80226723>] sys_newlstat+0x19/0x31
 [<ffffffff8025357e>] system_call+0x7e/0x83


Code: 48 8b 45 00 0f 18 08 48 8d 5d e8 44 39 73 30 75 72 4c 39 63 
RIP  [<ffffffff80208dea>] __d_lookup+0x72/0x112
 RSP <ffff81012f09bbf8>
CR2: 00000000fffffff4
Comment 1 Jan "Yenya" Kasprzak 2007-03-26 12:47:07 UTC
I _think_ it is XFS related - I have moved my file systems from XFS to ext3, and
has not seen this bug since then (the server is now up for two days on
2.6.21-rc4. I changing the component to XFS.
Comment 2 Natalie Protasevich 2007-05-23 14:11:19 UTC
Adding XFS and NFS maintainers to make sure they saw the problem report.
Thanks.
Comment 3 Dave Chinner 2007-05-23 18:31:45 UTC
Can you describe the test case that trips this? 
rsync from where to where? Is NFs involved in the rsync? any details 
you can provide woul dbe helpful.... 
 
BTW, is this an x86_64 kernel? 
 
Comment 4 Jan "Yenya" Kasprzak 2007-05-24 03:06:27 UTC
Re: comment #3

I don't have a test case - this had happened during a normal system load
(postfix, NFS/Samba serving users' home directories, etc). I _think_ it was
higher VFS load which triggered this message. I used "find / -type f -print",
but it was not always sufficient to trigger this oops.

As for the original report: we are now reconsidering HW problems again,
but we don't have any evidence to support this (like ECC errors or MCEs or
something like that).
Comment 5 Natalie Protasevich 2007-06-04 15:55:09 UTC
Hi Yan,
Can you provide your dmesg output? Some detail on your runtime environment would
be good too, such as if you run 64-bit kernel, and also how about the user space
- is it 32 or 64 bit? 
Thanks,
--Natalie
Comment 6 Jan "Yenya" Kasprzak 2007-06-05 05:05:16 UTC
Natalie, the kernel is 64-bit (as you can see from the Oops dump - there are
64-bit values, and registers like R08-R15, which IA-32 does not have).
The userspace as well - it is RHEL4 update 5 x86_64 now.

I can attach the dmesg output (so that you can look at the hardware
configuration), but unfortunately not dmesg from the kernel which crashed with
the reported Oops. I am running now (as requested by the HP hardware service)
the stock RHEL4 kernel kernel-2.6.9-55.EL with netdump configured, so that they can
debug this further in case the kernel crashes. But they are still considering HW
problems.

Is the dmesg from the RHEL4 kernel still interesting for you?
Comment 7 Natalie Protasevich 2007-06-11 09:53:31 UTC
I guess the dmesg from RHEL 4 will be OK, and the results of the HW 
investigation would be great.

Comment 8 Jan "Yenya" Kasprzak 2007-06-12 00:42:02 UTC
Created attachment 11736 [details]
dmesg from the RHEL4 kernel

Just to see the hardware and other properties of the system, I am attaching the dmesg output of the RHEL4 kernel from our DL585.
Comment 9 Natalie Protasevich 2007-06-20 01:40:54 UTC
Ian,
In #1 you mentioned that you were trying 2.6.21 with ext3 and were going back to xfs in suspecion that xfs might fail. What was the result of that test?
Also, it is always best to try latest kernel.org kernel simply to check if fixes directly or indirectly helped and problem was resolved. Can you test with latest as a checkpoint?
Comment 10 Dave Chinner 2007-07-25 18:03:36 UTC
Jan,

To make any progress on this, I really need a reproducable test case.
My experience with tracking down this sort of problem is that is almost
impossible without being able to reproduce it at will and instrumenting
the kernel to find out what is going wrong....

If that is not possible, I'd suggest turning on memory poisoning, slab
debug and other such runtime debug checks to see if there's a
use-after-free type of problem here....

If neither of these two methods provide any results, then I cannot see how
I can find whatever is going on here....
Comment 11 Jan "Yenya" Kasprzak 2007-08-30 01:43:04 UTC
Last weekend we have upgraded our server to RHEL5 (2.6.18-8.1.8.el5 kernel). So far it has not crashed, but then the traffic during Summer holidays is not as big as during the semester. Moreover, we are waiting for replacement CPU/memory daughterboards from HP - they said the newer board revision _may_ fix the memory bug similar (but not exactly the same) to our problems. We will see in a week or so.
Comment 12 Natalie Protasevich 2008-02-02 02:05:06 UTC
Jan, any updates from you on this problem?
Comment 13 Jan "Yenya" Kasprzak 2008-02-07 06:14:11 UTC
Since then HP people have replaced the memory+CPU daughterboards (not the memory
modules or CPUs themselves), and we are keeping our kernel up-to-date wrt. RHEL5 (currently at 2.6.18-53.1.6.el5), and we have not seen this problem since the HW upgrade or maybe since the upgrade from the comment #11.

So maybe we can close this bug as UNREPRODUCIBLE or INVALID? It is possible that it was a HW problem after all.
Comment 14 Jan "Yenya" Kasprzak 2008-02-18 04:36:14 UTC
Another note: after the last week's vmsplice() security problem, I have installed 2.6.24.2 on this server, and it still works without a problem (7+ days). So the HW problem is probably a correct answer to my problem.
Comment 15 Natalie Protasevich 2008-02-18 06:01:33 UTC
Great, thanks for following up.

Note You need to log in before you can comment on or make changes to this bug.