Bug 14453 - NULL pointer dereference on NFSv4 client
Summary: NULL pointer dereference on NFSv4 client
Status: CLOSED DUPLICATE of bug 14249
Alias: None
Product: File System
Classification: Unclassified
Component: NFS (show other bugs)
Hardware: All Linux
: P1 high
Assignee: Trond Myklebust
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2009-10-21 11:30 UTC by Harald Dunkel
Modified: 2009-12-17 07:58 UTC (History)
1 user (show)

See Also:
Kernel Version: 2.6.31.4
Subsystem:
Regression: No
Bisected commit-id:


Attachments
dmesg output of NFS client after the crash (53.32 KB, text/plain)
2009-10-21 11:30 UTC, Harald Dunkel
Details
kernel config file of NFS client (95.66 KB, text/plain)
2009-10-21 11:44 UTC, Harald Dunkel
Details

Description Harald Dunkel 2009-10-21 11:30:25 UTC
Created attachment 23488 [details]
dmesg output of NFS client after the crash

Hi folks,

evaluating NFSv4 I stumbled upon this problem on one of the clients (running Lenny, kernel 2.6.31.4 amd64):

[77160.800016] nfs: server nasl002 not responding, timed out
[77220.816022] nfs: server nasl002 not responding, still trying
[77302.413985] nfs: server nasl002 OK
[77302.448731] BUG: unable to handle kernel NULL pointer dereference at 0000000000000205
[77302.452657] IP: [<0000000000000205>] 0x205
[77302.452657] PGD 7b107067 PUD 7b108067 PMD 0 
[77302.452657] Oops: 0010 [#1] SMP 
[77302.452657] last sysfs file: /sys/class/net/bond1/operstate
[77302.452657] CPU 1 
[77302.452657] Modules linked in: nfsd exportfs nfs lockd nfs_acl auth_rpcgss sunrpc sha256_generic sha1_generic cn battery bonding ipv6 loop snd_pcm snd_timer snd soundcore amd64_edac_mod psmouse serio_raw snd_page_alloc edac_core k8temp pcspkr processor button shpchp i2c_piix4 i2c_core pci_hotplug evdev joydev reiserfs raid10 raid456 raid6_pq async_xor async_memcpy async_tx xor raid1 raid0 multipath linear md_mod ide_pci_generic usbhid usb_storage hid sd_mod serverworks ide_core sata_svw ata_generic libata ohci_hcd ehci_hcd e1000 scsi_mod thermal fan thermal_sys [last unloaded: drbd]
[77302.452657] Pid: 3855, comm: rpciod/1 Not tainted 2.6.31.4 #1 To Be Filled By O.E.M.
[77302.452657] RIP: 0010:[<0000000000000205>]  [<0000000000000205>] 0x205
[77302.452657] RSP: 0018:ffff8800789c1e18  EFLAGS: 00010246
[77302.452657] RAX: ffff88007577a9b0 RBX: ffff88007577a980 RCX: ffff88006c07dcb8
[77302.452657] RDX: ffff88007577a980 RSI: ffff8800789c1e80 RDI: ffff8800789f10c8
[77302.828031] RBP: ffff8800789f10c8 R08: ffff8800789c0000 R09: ffff88007f043778
[77302.828031] R10: 0000000000000001 R11: 00000000000186a0 R12: 0000000000000000
[77302.828031] R13: ffff8800789f1158 R14: 0000000000000001 R15: ffffffffa03c5228
[77302.828031] FS:  00007f457fc886e0(0000) GS:ffff8800015c9000(0000) knlGS:0000000000000000
[77302.828031] CS:  0010 DS: 0018 ES: 0018 CR0: 000000008005003b
[77302.828031] CR2: 0000000000000205 CR3: 000000007b031000 CR4: 00000000000006e0
[77302.828031] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[77302.828031] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
[77302.828031] Process rpciod/1 (pid: 3855, threadinfo ffff8800789c0000, task ffff88007f018f80)
[77302.828031] Stack:
[77302.828031]  ffffffffa03c5a1f ffff8800375f4d90 ffff8800789c1ef8 ffff8800789f10c8
[77302.828031] <0> ffffffffa03c4fd9 ffff8800789c1ef8 ffffc900012e04c0 ffff8800789f1170
[77302.828031] <0> ffff8800789f1178 ffff88007f018f80 ffffffff81052a06 ffffc900012e04d8
[77302.828031] Call Trace:
[77302.828031]  [<ffffffffa03c5a1f>] ? rpcauth_refreshcred+0x44/0x4f [sunrpc]
[77302.828031]  [<ffffffffa03c4fd9>] ? __rpc_execute+0x7d/0x240 [sunrpc]
[77302.828031]  [<ffffffff81052a06>] ? worker_thread+0x173/0x20f
[77302.828031]  [<ffffffff81056a0e>] ? autoremove_wake_function+0x0/0x2e
[77302.828031]  [<ffffffff81052893>] ? worker_thread+0x0/0x20f
[77302.828031]  [<ffffffff810566c0>] ? kthread+0x8b/0x93
[77302.828031]  [<ffffffff8100caea>] ? child_rip+0xa/0x20
[77302.828031]  [<ffffffff81056635>] ? kthread+0x0/0x93
[77302.828031]  [<ffffffff8100cae0>] ? child_rip+0x0/0x20
[77302.828031] Code:  Bad RIP value.
[77302.828031] RIP  [<0000000000000205>] 0x205
[77302.828031]  RSP <ffff8800789c1e18>
[77302.828031] CR2: 0000000000000205
[77303.612141] ---[ end trace d91eedf71dd4b980 ]---


% cat /proc/mounts
rootfs / rootfs rw 0 0
none /sys sysfs rw,nosuid,nodev,noexec,relatime 0 0
none /proc proc rw,nosuid,nodev,noexec,relatime 0 0
udev /dev tmpfs rw,relatime,size=10240k,mode=755 0 0
/dev/disk/by-label/root / reiserfs rw,relatime,notail 0 0
tmpfs /lib/init/rw tmpfs rw,nosuid,relatime,mode=755 0 0
usbfs /proc/bus/usb usbfs rw,nosuid,nodev,noexec,relatime 0 0
tmpfs /dev/shm tmpfs rw,nosuid,nodev,relatime 0 0
devpts /dev/pts devpts rw,nosuid,noexec,relatime,gid=5,mode=620 0 0
nfsd /proc/fs/nfsd nfsd rw,relatime 0 0
rpc_pipefs /var/lib/rpc_pipefs rpc_pipefs rw,relatime 0 0
nasl002:/data/ /mnt nfs4 rw,relatime,vers=4,rsize=1048576,wsize=1048576,namlen=255,hard,proto=tcp,port=0,timeo=100,retrans=3,sec=sys,clientaddr=172.19.96.129,addr=172.19.96.213 0 0


For testing I had stopped the nfs services on nasl002 (the server) for a few minutes, while 2 clients were running 3 kernel builds in parallel each. After NFS was back I got this message on one of the clients.

By now I saw this only once. The other client in the same configuration running the same test in parallel did not had this problem.

The NFSv4 server is running 2.6.29.6, drbd8 and heartbeat 2.1.4. No Kerberos. Here is the /etc/exports:

/nfs4           172.19.96.0/23(rw,fsid=root,insecure,no_subtree_check,async)
/nfs4/data      172.19.96.0/23(rw,nohide,insecure,no_subtree_check,async)

The server's log files don't indicate any problem.


Please mail if I can help to track this down


Regards

Harri
Comment 1 Harald Dunkel 2009-10-21 11:44:20 UTC
Created attachment 23489 [details]
kernel config file of NFS client
Comment 2 Trond Myklebust 2009-10-23 19:15:50 UTC
This looks like a duplicate of bug 14249 (which is still unresolved).

*** This bug has been marked as a duplicate of bug 14249 ***
Comment 3 Harald Dunkel 2009-10-24 07:38:00 UTC
Are you sure that #14249 is the same problem? The stack traces look _very_ different. I am not using Kerberos, but in the other bug report the crash occured in gss_validate().
Comment 4 Trond Myklebust 2009-10-24 14:32:32 UTC
They both appear to be use-after-free issues with the RPC credentials, so I strongly suspect that you are all seeing the the same bug causing crashes in different parts of the code.

I'm pretty sure we haven't changed any of the RPC auth code in the 2.6.31 cycle, so I don't think that's where the bug is coming from. Rather it would be something in the NFS layer that is putting the reference count twice. I strongly suspect the NFSv4.1 merge here (in fact I believe I've found at least one bug that can explain it - see the patch that I posted in bug 14249)...

If it turns out that the patch I provided in bug 14249 fixes one case but not the other, then I'll unlink the bugs again and treating them as separate...

Note You need to log in before you can comment on or make changes to this bug.