Bug 7929
Summary: | Oops in __d_lookup (sys_lstat() call path) | ||
---|---|---|---|
Product: | File System | Reporter: | Jan "Yenya" Kasprzak (kas) |
Component: | XFS | Assignee: | Dave Chinner (dgc) |
Status: | REJECTED INVALID | ||
Severity: | normal | CC: | cw, neilb, protasnb, trondmy, xfs-masters, zhseal0 |
Priority: | P2 | ||
Hardware: | i386 | ||
OS: | Linux | ||
Kernel Version: | 2.6.14-rc4, 2.6.20-rc7 | Subsystem: | |
Regression: | --- | Bisected commit-id: | |
Attachments: | dmesg from the RHEL4 kernel |
Description
Jan "Yenya" Kasprzak
2007-02-03 12:23:01 UTC
I _think_ it is XFS related - I have moved my file systems from XFS to ext3, and has not seen this bug since then (the server is now up for two days on 2.6.21-rc4. I changing the component to XFS. Adding XFS and NFS maintainers to make sure they saw the problem report. Thanks. Can you describe the test case that trips this? rsync from where to where? Is NFs involved in the rsync? any details you can provide woul dbe helpful.... BTW, is this an x86_64 kernel? Re: comment #3 I don't have a test case - this had happened during a normal system load (postfix, NFS/Samba serving users' home directories, etc). I _think_ it was higher VFS load which triggered this message. I used "find / -type f -print", but it was not always sufficient to trigger this oops. As for the original report: we are now reconsidering HW problems again, but we don't have any evidence to support this (like ECC errors or MCEs or something like that). Hi Yan, Can you provide your dmesg output? Some detail on your runtime environment would be good too, such as if you run 64-bit kernel, and also how about the user space - is it 32 or 64 bit? Thanks, --Natalie Natalie, the kernel is 64-bit (as you can see from the Oops dump - there are 64-bit values, and registers like R08-R15, which IA-32 does not have). The userspace as well - it is RHEL4 update 5 x86_64 now. I can attach the dmesg output (so that you can look at the hardware configuration), but unfortunately not dmesg from the kernel which crashed with the reported Oops. I am running now (as requested by the HP hardware service) the stock RHEL4 kernel kernel-2.6.9-55.EL with netdump configured, so that they can debug this further in case the kernel crashes. But they are still considering HW problems. Is the dmesg from the RHEL4 kernel still interesting for you? I guess the dmesg from RHEL 4 will be OK, and the results of the HW investigation would be great. Created attachment 11736 [details]
dmesg from the RHEL4 kernel
Just to see the hardware and other properties of the system, I am attaching the dmesg output of the RHEL4 kernel from our DL585.
Ian, In #1 you mentioned that you were trying 2.6.21 with ext3 and were going back to xfs in suspecion that xfs might fail. What was the result of that test? Also, it is always best to try latest kernel.org kernel simply to check if fixes directly or indirectly helped and problem was resolved. Can you test with latest as a checkpoint? Jan, To make any progress on this, I really need a reproducable test case. My experience with tracking down this sort of problem is that is almost impossible without being able to reproduce it at will and instrumenting the kernel to find out what is going wrong.... If that is not possible, I'd suggest turning on memory poisoning, slab debug and other such runtime debug checks to see if there's a use-after-free type of problem here.... If neither of these two methods provide any results, then I cannot see how I can find whatever is going on here.... Last weekend we have upgraded our server to RHEL5 (2.6.18-8.1.8.el5 kernel). So far it has not crashed, but then the traffic during Summer holidays is not as big as during the semester. Moreover, we are waiting for replacement CPU/memory daughterboards from HP - they said the newer board revision _may_ fix the memory bug similar (but not exactly the same) to our problems. We will see in a week or so. Jan, any updates from you on this problem? Since then HP people have replaced the memory+CPU daughterboards (not the memory modules or CPUs themselves), and we are keeping our kernel up-to-date wrt. RHEL5 (currently at 2.6.18-53.1.6.el5), and we have not seen this problem since the HW upgrade or maybe since the upgrade from the comment #11. So maybe we can close this bug as UNREPRODUCIBLE or INVALID? It is possible that it was a HW problem after all. Another note: after the last week's vmsplice() security problem, I have installed 2.6.24.2 on this server, and it still works without a problem (7+ days). So the HW problem is probably a correct answer to my problem. Great, thanks for following up. |