Bug 14641 - cifs crash
Summary: cifs crash
Status: RESOLVED CODE_FIX
Alias: None
Product: File System
Classification: Unclassified
Component: CIFS (show other bugs)
Hardware: All Linux
: P1 normal
Assignee: Shirish Pargaonkar
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2009-11-19 15:23 UTC by Vladimir Stavrinov
Modified: 2009-12-18 20:02 UTC (History)
3 users (show)

See Also:
Kernel Version: 2.6.30
Subsystem:
Regression: Yes
Bisected commit-id:


Attachments
kernel crash on cifs while net boot of kvm (178.93 KB, image/jpeg)
2009-11-19 15:31 UTC, Vladimir Stavrinov
Details
snapshot of 2.6.31 crash (185.57 KB, image/jpeg)
2009-11-19 16:21 UTC, Vladimir Stavrinov
Details
cifs module 2.6.31-1-amd64 (422.78 KB, application/octet-stream)
2009-11-19 16:23 UTC, Vladimir Stavrinov
Details
cifsFYI debug snapshot (192.57 KB, image/jpeg)
2009-11-19 18:49 UTC, Vladimir Stavrinov
Details
check nd before accessing its fields (754 bytes, patch)
2009-11-19 20:11 UTC, Shirish Pargaonkar
Details | Diff
check for nd before accessing its fields, in two places where it is missing (1.20 KB, patch)
2009-11-19 20:52 UTC, Shirish Pargaonkar
Details | Diff
cifs patch for 2.6.31 kernel (893 bytes, patch)
2009-11-20 19:43 UTC, Vladimir Stavrinov
Details | Diff

Description Vladimir Stavrinov 2009-11-19 15:23:53 UTC
See kvm screen dump of net boot process in attachment. This is amd64. The same picture for 686 and real hardware as well. Live initrd generated in chrooted environment serving as network root filesystem for diskless workstations. This bug continued from 2.6.30 and it is not exists in at least 2.6.26.
Comment 1 Vladimir Stavrinov 2009-11-19 15:31:30 UTC
Created attachment 23829 [details]
kernel crash on cifs while net boot of kvm
Comment 2 Jeff Layton 2009-11-19 15:44:13 UTC
Could you attach the cifs.ko from this kernel? I need to disassemble it to see where it crashed. Another thing that might be helpful is turning up cifsFYI and reproducing this crash. Instructions on how to do that are here:

http://wiki.samba.org/index.php/LinuxCIFS_troubleshooting

It looks like you're using CIFS as part of an aufs stack? I have to wonder whether it's passing something unexpected to the lookup here.
Comment 3 Vladimir Stavrinov 2009-11-19 16:21:21 UTC
Created attachment 23830 [details]
snapshot of 2.6.31 crash

This is virtual machine, based on kvm module, booting from network. Root fs is cifs. initrd generated by scripts from debian-live project. It is using aufs.  The kernel itself and cifs module both are from binary debian package 
linux-image-2.6.31-1-amd64 version 2.6.31-2
Comment 4 Vladimir Stavrinov 2009-11-19 16:23:18 UTC
Created attachment 23831 [details]
cifs module 2.6.31-1-amd64

The crash of this module shown in previous snapshot
Comment 5 Vladimir Stavrinov 2009-11-19 16:30:06 UTC
(In reply to comment #2)

> where it crashed. Another thing that might be helpful is turning up cifsFYI
> and
> reproducing this crash. Instructions on how to do that are here:

How to do 

echo 7 > /proc/fs/cifs/cifsFYI

while boot process? The initrd scripts should be modified for this.
Comment 6 Vladimir Stavrinov 2009-11-19 18:49:40 UTC
Created attachment 23832 [details]
cifsFYI debug snapshot

(In reply to comment #5)

> How to do 
> 
> echo 7 > /proc/fs/cifs/cifsFYI
> 
> while boot process? The initrd scripts should be modified for this.

What is done:

1. add rdinit=/bin/sh parameter to kernel command line
2. When lending on initrd:
   - load cifs module
   - enable cifsFYI
   - edit scripts/live and add dmesg just after mount.cifs
3. And finally resume boot:
    exec /init 

The result You see on snapshot
Comment 7 Jeff Layton 2009-11-19 19:12:23 UTC
Thanks for the info, that helps. The code crashed in this instruction:

   1028d:       f6 42 31 04             testb  $0x4,0x31(%rdx)

...which basically says take the value at the address given by %rdx + 31 and then bitwise and it with 0x4. the problem is that %rdx holds 0 at this time, and that causes a null pointer dereference.

This code corresponds with this:

        /*
         * O_EXCL: optimize away the lookup, but don't hash the dentry. Let
         * the VFS handle the create.
         */
        if (nd->flags & LOOKUP_EXCL) {
                d_instantiate(direntry, NULL);
                return 0;
        }

...LOOKUP_EXCL == 0x4. This means that the 'nd' pointer passed into cifs_lookup was NULL. I don't think the VFS ever does this on its own, so I'm guessing that this must be an artifact of aufs. In any case, a cursory look at other lookup ops indicates that they assume that it's possible to get a NULL nameidata pointer, so I suppose we should account for that possibility here.

I'll see if I can get the attention of the person who added the code to do create on lookup. They may have an idea of how they want to fix it.
Comment 8 Jeff Layton 2009-11-19 19:13:43 UTC
Shirish/Steve,
  Any thoughts on how you want to approach fixing this?
Comment 9 Jeff Layton 2009-11-19 19:15:16 UTC
In a nutshell, the problem here is that cifs_lookup assumes that nd will be a non-NULL pointer, but it shouldn't.
Comment 10 Shirish Pargaonkar 2009-11-19 19:23:56 UTC
Looking
Comment 11 Shirish Pargaonkar 2009-11-19 19:29:14 UTC
We probably need to something like this

nd && nd->flags & LOOKUP_EXCL

wherever cifs checks nd flags
Comment 12 Shirish Pargaonkar 2009-11-19 20:11:59 UTC
Created attachment 23834 [details]
check nd before accessing its fields

Fix the oops by checking whether nd (nameidata) is NULL or not before
accessing its fields
Comment 13 Jeff Layton 2009-11-19 20:17:23 UTC
What about the other places where nd is referenced in that function?
Comment 14 Shirish Pargaonkar 2009-11-19 20:22:31 UTC
Jeff, nd is checked before accessing in the rest of the places.
Comment 15 Jeff Layton 2009-11-19 20:32:20 UTC
That doesn't appear to be the case in the code I'm looking at:

        if (pTcon->unix_ext) {
                if (!(nd->flags & (LOOKUP_PARENT | LOOKUP_DIRECTORY)) &&
                     (nd->flags & LOOKUP_OPEN) && !pTcon->broken_posix_open &&
                     (nd->intent.open.flags & O_CREAT)) {
                        rc = cifs_posix_open(full_path, &newInode, nd->path.mnt,
                                        nd->intent.open.create_mode,
                                        nd->intent.open.flags, &oplock,
                                        &fileHandle, xid);

...if LOOKUP_EXCL isn't set but unix extensions are enabled, then you'll end up dereferencing nd->flags again and oopsing.
Comment 16 Shirish Pargaonkar 2009-11-19 20:52:22 UTC
Created attachment 23835 [details]
check for nd before accessing its fields, in two places where it is missing

Jeff, you are right, added check for nameidata before accessing its fields
in two places now, instead of just one as in previous patch.
Comment 17 Jeff Layton 2009-11-19 20:59:32 UTC
That looks like it'll fix it.

Vladimir, would you be able to test the patch in comment #16 and let us know if it resolves the problem?
Comment 18 Vladimir Stavrinov 2009-11-20 19:43:25 UTC
Created attachment 23847 [details]
cifs patch for 2.6.31 kernel

I have slightly modified Your patch to adopt for kernel 2.6.31. After applying and compiling the kernel booted successfully. Thanks.
Comment 19 Jeff Layton 2009-11-21 16:37:20 UTC
Reassigning to Shirish...
Comment 20 Vladimir Stavrinov 2009-12-18 15:49:12 UTC
Sorry, but there are bad news. I didn't notice this new bug at the first time testing, but now it is look like if it is originating from this patch. In any way the fact is that after successful booting I see some duplicated directories: /etc, /usr and /var at the root level. But there are some others in the depth. This have some side effects. For example, the gdm chooser can not find any hosts on the net. This bug appear in the my patched 2.6.31 kernel and new one recently came the 2.6.32. If You think it is aufs related bug, than You should take in account that there are no such problem with the same kernels and live cd, because there are no cifs.
Comment 21 Shirish Pargaonkar 2009-12-18 15:52:45 UTC
Vladimir, can you please open a separate bug?  This bug was opened to report
oops, which is fixed (is that correct?), so I think a separate bug to report a separate bug would be correct way to go.
Comment 22 Jeff Layton 2009-12-18 16:12:28 UTC
Agreed, please open a new bug for this. I suspect though that you have a semi-broken CIFS server here. You may want to mount with "noserverino" and see if that works around the issue. Please state whether that helps when you report the bug.
Comment 23 Vladimir Stavrinov 2009-12-18 20:02:54 UTC
You see the new opened bug:
http://bugzilla.kernel.org/show_bug.cgi?id=14833

The "noserverino" option dont' help

CIFS server is not broken - there no such problem with 2.6.26 kernel

Note You need to log in before you can comment on or make changes to this bug.