Most recent kernel where this bug did *NOT* occur: 2.6.11.10 Distribution: RHEL4 update 3 Hardware Environment: HP DL585 (4x Opteron 848), 26 GB RAM, Compaq Smartarray 5i (cciss driver) for the root filesystem, QLogic 2312 HBA (qla2xxx driver) for the rest of storage (an external array - IBM FAStT 600). HW problem is unlikely, because we have recently replaced most of the system components (incl. mainboard and the Smartarray) Software Environment: The root filesystem is ext3, all other filesystems are XFS on LVM on the IBM array. The server has various tasks, including providing home directories for ~2200 users via NFS and Samba, mail server, etc. Problem Description: After booting to a new kernel, usually within 1 day (often much faster, depending on the system load), I get the attached Oops. See also my year and half old mail to LKML: http://lkml.org/lkml/2005/10/22/13 where I report this problem on 2.6.14-rc4 kernel. It contains another oops dump. I am running 2.6.11.10, because newer kernels have this problem as well (including 2.6.20-rc7 I've tested yesterday). I occasionally try to upgrade from 2.6.11.10 to something newer, but have to get back because of this problem. The process in which the oops happens varies - this one is for rsync, but I have seen others (e.g. nfsd) as well. I am filling this under VFS, but it might also be XFS related. The oops is: Unable to handle kernel paging request at 00000000fffffff4 RIP: [<ffffffff80208dea>] __d_lookup+0x72/0x112 PGD 5cea88067 PUD 0 Oops: 0000 [46] SMP CPU 2 Modules linked in: ohci_hcd usbcore i2c_amd756 i2c_core k8temp hwmon qla2xxx amd74xx ide_core Pid: 32250, comm: rsync Not tainted 2.6.20-rc7 #96 RIP: 0010:[<ffffffff80208dea>] [<ffffffff80208dea>] __d_lookup+0x72/0x112 RSP: 0018:ffff81012f09bbf8 EFLAGS: 00010202 RAX: 00000000fffffff4 RBX: ffff81067d184080 RCX: 0000000000000016 RDX: 0000000000117295 RSI: 018723826d117295 RDI: ffff8103b4ad7188 RBP: 00000000fffffff4 R08: ffff81012f09be48 R09: 0000000000000246 R10: 0000000000000246 R11: 0000000000000246 R12: ffff8103b4ad7188 R13: ffff81012f09bcc8 R14: 00000000bd3c3a0d R15: 0000000000000006 FS: 00002ad0394e23a0(0000) GS:ffff8104000ef640(0000) knlGS:00000000f6ecebb0 CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b CR2: 00000000fffffff4 CR3: 00000005b8630000 CR4: 00000000000006e0 Process rsync (pid: 32250, threadinfo ffff81012f09a000, task ffff810138a7d850) Stack: 0000000000000246 0000000000000000 ffff810578fc9015 0000000000000000 ffff81012f09be48 ffff8103b4ad55e8 0000000000000000 ffff81012f09be48 ffff81012f09bcc8 ffffffff8020c3ef ffff81012f09be48 ffff8104002ba580 Call Trace: [<ffffffff8020c3ef>] do_lookup+0x2a/0x1b8 [<ffffffff8020971e>] __link_path_walk+0x894/0xccb [<ffffffff80299598>] zone_statistics+0x41/0x63 [<ffffffff8020dc88>] link_path_walk+0x4c/0xc2 [<ffffffff8020c25e>] do_path_lookup+0x1b0/0x20c [<ffffffff802200ac>] __user_walk_fd+0x37/0x4c [<ffffffff80238ce2>] vfs_lstat_fd+0x18/0x47 [<ffffffff80226723>] sys_newlstat+0x19/0x31 [<ffffffff8025357e>] system_call+0x7e/0x83 Code: 48 8b 45 00 0f 18 08 48 8d 5d e8 44 39 73 30 75 72 4c 39 63 RIP [<ffffffff80208dea>] __d_lookup+0x72/0x112 RSP <ffff81012f09bbf8> CR2: 00000000fffffff4
I _think_ it is XFS related - I have moved my file systems from XFS to ext3, and has not seen this bug since then (the server is now up for two days on 2.6.21-rc4. I changing the component to XFS.
Adding XFS and NFS maintainers to make sure they saw the problem report. Thanks.
Can you describe the test case that trips this? rsync from where to where? Is NFs involved in the rsync? any details you can provide woul dbe helpful.... BTW, is this an x86_64 kernel?
Re: comment #3 I don't have a test case - this had happened during a normal system load (postfix, NFS/Samba serving users' home directories, etc). I _think_ it was higher VFS load which triggered this message. I used "find / -type f -print", but it was not always sufficient to trigger this oops. As for the original report: we are now reconsidering HW problems again, but we don't have any evidence to support this (like ECC errors or MCEs or something like that).
Hi Yan, Can you provide your dmesg output? Some detail on your runtime environment would be good too, such as if you run 64-bit kernel, and also how about the user space - is it 32 or 64 bit? Thanks, --Natalie
Natalie, the kernel is 64-bit (as you can see from the Oops dump - there are 64-bit values, and registers like R08-R15, which IA-32 does not have). The userspace as well - it is RHEL4 update 5 x86_64 now. I can attach the dmesg output (so that you can look at the hardware configuration), but unfortunately not dmesg from the kernel which crashed with the reported Oops. I am running now (as requested by the HP hardware service) the stock RHEL4 kernel kernel-2.6.9-55.EL with netdump configured, so that they can debug this further in case the kernel crashes. But they are still considering HW problems. Is the dmesg from the RHEL4 kernel still interesting for you?
I guess the dmesg from RHEL 4 will be OK, and the results of the HW investigation would be great.
Created attachment 11736 [details] dmesg from the RHEL4 kernel Just to see the hardware and other properties of the system, I am attaching the dmesg output of the RHEL4 kernel from our DL585.
Ian, In #1 you mentioned that you were trying 2.6.21 with ext3 and were going back to xfs in suspecion that xfs might fail. What was the result of that test? Also, it is always best to try latest kernel.org kernel simply to check if fixes directly or indirectly helped and problem was resolved. Can you test with latest as a checkpoint?
Jan, To make any progress on this, I really need a reproducable test case. My experience with tracking down this sort of problem is that is almost impossible without being able to reproduce it at will and instrumenting the kernel to find out what is going wrong.... If that is not possible, I'd suggest turning on memory poisoning, slab debug and other such runtime debug checks to see if there's a use-after-free type of problem here.... If neither of these two methods provide any results, then I cannot see how I can find whatever is going on here....
Last weekend we have upgraded our server to RHEL5 (2.6.18-8.1.8.el5 kernel). So far it has not crashed, but then the traffic during Summer holidays is not as big as during the semester. Moreover, we are waiting for replacement CPU/memory daughterboards from HP - they said the newer board revision _may_ fix the memory bug similar (but not exactly the same) to our problems. We will see in a week or so.
Jan, any updates from you on this problem?
Since then HP people have replaced the memory+CPU daughterboards (not the memory modules or CPUs themselves), and we are keeping our kernel up-to-date wrt. RHEL5 (currently at 2.6.18-53.1.6.el5), and we have not seen this problem since the HW upgrade or maybe since the upgrade from the comment #11. So maybe we can close this bug as UNREPRODUCIBLE or INVALID? It is possible that it was a HW problem after all.
Another note: after the last week's vmsplice() security problem, I have installed 2.6.24.2 on this server, and it still works without a problem (7+ days). So the HW problem is probably a correct answer to my problem.
Great, thanks for following up.