See kvm screen dump of net boot process in attachment. This is amd64. The same picture for 686 and real hardware as well. Live initrd generated in chrooted environment serving as network root filesystem for diskless workstations. This bug continued from 2.6.30 and it is not exists in at least 2.6.26.
Created attachment 23829 [details] kernel crash on cifs while net boot of kvm
Could you attach the cifs.ko from this kernel? I need to disassemble it to see where it crashed. Another thing that might be helpful is turning up cifsFYI and reproducing this crash. Instructions on how to do that are here: http://wiki.samba.org/index.php/LinuxCIFS_troubleshooting It looks like you're using CIFS as part of an aufs stack? I have to wonder whether it's passing something unexpected to the lookup here.
Created attachment 23830 [details] snapshot of 2.6.31 crash This is virtual machine, based on kvm module, booting from network. Root fs is cifs. initrd generated by scripts from debian-live project. It is using aufs. The kernel itself and cifs module both are from binary debian package linux-image-2.6.31-1-amd64 version 2.6.31-2
Created attachment 23831 [details] cifs module 2.6.31-1-amd64 The crash of this module shown in previous snapshot
(In reply to comment #2) > where it crashed. Another thing that might be helpful is turning up cifsFYI > and > reproducing this crash. Instructions on how to do that are here: How to do echo 7 > /proc/fs/cifs/cifsFYI while boot process? The initrd scripts should be modified for this.
Created attachment 23832 [details] cifsFYI debug snapshot (In reply to comment #5) > How to do > > echo 7 > /proc/fs/cifs/cifsFYI > > while boot process? The initrd scripts should be modified for this. What is done: 1. add rdinit=/bin/sh parameter to kernel command line 2. When lending on initrd: - load cifs module - enable cifsFYI - edit scripts/live and add dmesg just after mount.cifs 3. And finally resume boot: exec /init The result You see on snapshot
Thanks for the info, that helps. The code crashed in this instruction: 1028d: f6 42 31 04 testb $0x4,0x31(%rdx) ...which basically says take the value at the address given by %rdx + 31 and then bitwise and it with 0x4. the problem is that %rdx holds 0 at this time, and that causes a null pointer dereference. This code corresponds with this: /* * O_EXCL: optimize away the lookup, but don't hash the dentry. Let * the VFS handle the create. */ if (nd->flags & LOOKUP_EXCL) { d_instantiate(direntry, NULL); return 0; } ...LOOKUP_EXCL == 0x4. This means that the 'nd' pointer passed into cifs_lookup was NULL. I don't think the VFS ever does this on its own, so I'm guessing that this must be an artifact of aufs. In any case, a cursory look at other lookup ops indicates that they assume that it's possible to get a NULL nameidata pointer, so I suppose we should account for that possibility here. I'll see if I can get the attention of the person who added the code to do create on lookup. They may have an idea of how they want to fix it.
Shirish/Steve, Any thoughts on how you want to approach fixing this?
In a nutshell, the problem here is that cifs_lookup assumes that nd will be a non-NULL pointer, but it shouldn't.
Looking
We probably need to something like this nd && nd->flags & LOOKUP_EXCL wherever cifs checks nd flags
Created attachment 23834 [details] check nd before accessing its fields Fix the oops by checking whether nd (nameidata) is NULL or not before accessing its fields
What about the other places where nd is referenced in that function?
Jeff, nd is checked before accessing in the rest of the places.
That doesn't appear to be the case in the code I'm looking at: if (pTcon->unix_ext) { if (!(nd->flags & (LOOKUP_PARENT | LOOKUP_DIRECTORY)) && (nd->flags & LOOKUP_OPEN) && !pTcon->broken_posix_open && (nd->intent.open.flags & O_CREAT)) { rc = cifs_posix_open(full_path, &newInode, nd->path.mnt, nd->intent.open.create_mode, nd->intent.open.flags, &oplock, &fileHandle, xid); ...if LOOKUP_EXCL isn't set but unix extensions are enabled, then you'll end up dereferencing nd->flags again and oopsing.
Created attachment 23835 [details] check for nd before accessing its fields, in two places where it is missing Jeff, you are right, added check for nameidata before accessing its fields in two places now, instead of just one as in previous patch.
That looks like it'll fix it. Vladimir, would you be able to test the patch in comment #16 and let us know if it resolves the problem?
Created attachment 23847 [details] cifs patch for 2.6.31 kernel I have slightly modified Your patch to adopt for kernel 2.6.31. After applying and compiling the kernel booted successfully. Thanks.
Reassigning to Shirish...
Sorry, but there are bad news. I didn't notice this new bug at the first time testing, but now it is look like if it is originating from this patch. In any way the fact is that after successful booting I see some duplicated directories: /etc, /usr and /var at the root level. But there are some others in the depth. This have some side effects. For example, the gdm chooser can not find any hosts on the net. This bug appear in the my patched 2.6.31 kernel and new one recently came the 2.6.32. If You think it is aufs related bug, than You should take in account that there are no such problem with the same kernels and live cd, because there are no cifs.
Vladimir, can you please open a separate bug? This bug was opened to report oops, which is fixed (is that correct?), so I think a separate bug to report a separate bug would be correct way to go.
Agreed, please open a new bug for this. I suspect though that you have a semi-broken CIFS server here. You may want to mount with "noserverino" and see if that works around the issue. Please state whether that helps when you report the bug.
You see the new opened bug: http://bugzilla.kernel.org/show_bug.cgi?id=14833 The "noserverino" option dont' help CIFS server is not broken - there no such problem with 2.6.26 kernel