Bug 13375 - Kernel crash with 2.6.29 + nfs + xfs (radix-tree)
Summary: Kernel crash with 2.6.29 + nfs + xfs (radix-tree)
Status: CLOSED CODE_FIX
Alias: None
Product: File System
Classification: Unclassified
Component: XFS (show other bugs)
Hardware: All Linux
: P1 normal
Assignee: Christoph Hellwig
URL:
Keywords:
Depends on:
Blocks: 12398
  Show dependency tree
 
Reported: 2009-05-24 21:26 UTC by Rafael J. Wysocki
Modified: 2012-01-25 21:57 UTC (History)
17 users (show)

See Also:
Kernel Version: 2.6.29
Subsystem:
Regression: Yes
Bisected commit-id:


Attachments
Potential fix in xfs_vnodeops.c (604 bytes, patch)
2009-06-12 18:39 UTC, Stephane
Details | Diff
Screenshot of same problem (22.95 KB, image/png)
2009-06-17 08:39 UTC, Tommy
Details
My 2.6.30 config (9.56 KB, application/octet-stream)
2009-07-20 19:12 UTC, Nickolay
Details
call trace screen (30.37 KB, image/gif)
2009-07-25 14:50 UTC, Nickolay
Details
Various iget fixes (19.49 KB, patch)
2009-08-03 00:32 UTC, Christoph Hellwig
Details | Diff
Iget fixes, refreshed (18.75 KB, patch)
2009-08-03 11:53 UTC, Christoph Hellwig
Details | Diff
Iget fixes rollup for Linux 2.6.30 (18.52 KB, patch)
2009-08-03 18:11 UTC, Christoph Hellwig
Details | Diff
export the right symbol (449 bytes, patch)
2009-08-03 21:06 UTC, Christoph Hellwig
Details | Diff
2.6.30-stable backport patch series (4.83 KB, application/x-compressed-tar)
2009-08-16 02:42 UTC, Christoph Hellwig
Details
2.6.29-stable backport patch series (4.54 KB, application/x-compressed-tar)
2009-08-16 04:05 UTC, Christoph Hellwig
Details
screenshot (405.01 KB, image/jpeg)
2012-01-25 21:54 UTC, Ferenc Wágner
Details
The config going with attachment 72194 (62.12 KB, application/octet-stream)
2012-01-25 21:57 UTC, Ferenc Wágner
Details

Description Rafael J. Wysocki 2009-05-24 21:26:58 UTC
Subject    : Kernel crash with 2.6.29 + nfs + xfs (radix-tree)
Submitter  : Alex Samad <alex@samad.com.au>
Date       : 2009-05-20 0:37
References : http://marc.info/?l=linux-kernel&m=124278675503699&w=4
Notify-Also : Dave Chinner <david@fromorbit.com>

This entry is being used for tracking a regression from 2.6.28.  Please don't
close it until the problem is fixed in the mainline.
Comment 1 Andrew Randrianasulu 2009-05-25 22:55:10 UTC
Thanks for bringing this up, i has similar looking bug two days ago:


---------------

May 17 01:55:37 (none) kernel: reconnect_path: npd != pd                                                                           
May 17 01:55:56 (none) last message repeated 527 times                                                                             
May 17 01:58:28 (none) kernel: ------------[ cut here ]------------                                                                
May 17 01:58:28 (none) kernel: kernel BUG at lib/radix-tree.c:485!                                                                 
May 17 01:58:28 (none) kernel: invalid opcode: 0000 [#1] SMP                                                                       
May 17 01:58:28 (none) kernel: last sysfs file: /sys/devices/platform/w83627hf.656/in8_input                                       
May 17 01:58:28 (none) kernel: Modules linked in: radeon drm ipv6 sg sd_mod snd_seq_dummy snd_seq_oss snd_seq_midi_event snd_seq sn
d_pcm_oss snd_mixer_oss nfsd lockd nfs_acl auth_rpcgss sunrpc usb_storage scsi_mod libusual tuner_simple tuner_types tda9887 tda829
0 snd_cs4232 snd_mpu401 snd_emu10k1 snd_wavefront snd_via82xx snd_ac97_codec snd_wss_lib saa7134 ac97_bus snd_pcm snd_opl3_lib uhci
_hcd ir_common ehci_hcd snd_timer videobuf_dma_sg snd_page_alloc shpchp snd_mpu401_uart videobuf_core snd_util_mem usbcore tveeprom
 snd_hwdep pci_hotplug ppdev snd_rawmidi snd_seq_device emu10k1_gp rtc_cmos 8139too rtc_core via_agp snd rtc_lib ns558 gameport sou
ndcore parport_pc parport xfs exportfs ufs agpgart tuner v4l2_common videodev v4l1_compat w83627hf hwmon_vid hwmon i2c_viapro i2c_d
ev                                                                                                                                 
May 17 01:58:28 (none) kernel:                                                                                                     
May 17 01:58:28 (none) kernel: Pid: 199, comm: kswapd0 Not tainted (2.6.29.3-i486 #2) MS-6380E                                     
May 17 01:58:28 (none) kernel: EIP: 0060:[<c10ff24b>] EFLAGS: 00010246 CPU: 0                                                      
May 17 01:58:28 (none) kernel: EIP is at radix_tree_tag_set+0x6e/0xa8                                                              
May 17 01:58:28 (none) kernel: EAX: cf546928 EBX: 00000024 ECX: 00000024 EDX: 00000000                                             
May 17 01:58:28 (none) kernel: ESI: 00000000 EDI: 00000001 EBP: cfa7fe80 ESP: cfa7fe60                                             
May 17 01:58:28 (none) kernel:  DS: 007b ES: 007b FS: 00d8 GS: 0000 SS: 0068                                                       
May 17 01:58:28 (none) kernel: Process kswapd0 (pid: 199, ti=cfa7f000 task=cfa80f80 task.ti=cfa7f000)                              
May 17 01:58:28 (none) kernel: Stack:                                                                                              
May 17 01:58:28 (none) kernel:  00000000 01e1dd64 cf03ba58 00000000 00000000 c9311b80 cf2da000 cf03ba20                            
May 17 01:58:28 (none) kernel:  cfa7fe9c d0abbc48 c9311c24 cf03ba54 c9311ca4 c9311b80 0000006c cfa7feac

==========
and another one


May 24 08:48:00 (none) last message repeated 61 times                                                                              
May 24 08:48:47 (none) last message repeated 760 times                                                                             
May 24 08:50:55 (none) kernel: reconnect_path: npd != pd                                                                           
May 24 08:50:55 (none) last message repeated 9 times                                                                               
May 24 08:55:04 (none) kernel: reconnect_path: npd != pd                                                                           
May 24 08:56:05 (none) last message repeated 47 times                                                                              
May 24 08:56:49 (none) last message repeated 419 times                                                                             
May 24 09:19:39 (none) kernel: ------------[ cut here ]------------                                                                
May 24 09:19:39 (none) kernel: kernel BUG at lib/radix-tree.c:473!                                                                 
May 24 09:19:39 (none) kernel: invalid opcode: 0000 [#1] SMP                                                                       
May 24 09:19:39 (none) kernel: last sysfs file: /sys/devices/platform/w83627hf.656/in8_input                                       
May 24 09:19:39 (none) kernel: Modules linked in: radeon drm ipv6 sg sd_mod snd_seq_dummy snd_seq_oss snd_seq_midi_event snd_seq sn
d_pcm_oss snd_mixer_oss nfsd lockd nfs_acl auth_rpcgss sunrpc usb_storage scsi_mod libusual tuner_simple tuner_types tda9887 snd_cs
4232 snd_mpu401 tda8290 snd_emu10k1 snd_wavefront snd_via82xx snd_ac97_codec snd_wss_lib saa7134 ac97_bus uhci_hcd snd_pcm ir_commo
n ehci_hcd snd_opl3_lib videobuf_dma_sg snd_page_alloc snd_timer snd_util_mem snd_mpu401_uart videobuf_core usbcore tveeprom ppdev 
shpchp snd_hwdep pci_hotplug emu10k1_gp snd_rawmidi snd_seq_device 8139too rtc_cmos via_agp rtc_core snd rtc_lib ns558 gameport sou
ndcore parport_pc parport xfs exportfs ufs agpgart tuner v4l2_common videodev v4l1_compat w83627hf hwmon_vid hwmon i2c_viapro i2c_d
ev                                                                                                                                 
May 24 09:19:39 (none) kernel:                                                                                                     
May 24 09:19:39 (none) kernel: Pid: 199, comm: kswapd0 Not tainted (2.6.29.4-i486 #3) MS-6380E                                     
May 24 09:19:39 (none) kernel: EIP: 0060:[<c10ff322>] EFLAGS: 00010202 CPU: 0                                                      
May 24 09:19:39 (none) kernel: EIP is at radix_tree_tag_set+0x1d/0xa8                                                              
May 24 09:19:39 (none) kernel: EAX: cf083a58 EBX: c6e70280 ECX: 00000000 EDX: 01e1dd64                                             
May 24 09:19:39 (none) kernel: ESI: cf2d8c00 EDI: 00000000 EBP: cfa86e80 ESP: cfa86e60                                             
May 24 09:19:39 (none) kernel:  DS: 007b ES: 007b FS: 00d8 GS: 0000 SS: 0068                                                       
May 24 09:19:39 (none) kernel: Process kswapd0 (pid: 199, ti=cfa86000 task=cfa94000 task.ti=cfa86000)                              
May 24 09:19:39 (none) kernel: Stack:                                                                                              
May 24 09:19:39 (none) kernel:  00000000 01e1dd64 cf083a58 cf083bec ce33ef84 c6e70280 cf2d8c00 cf083a20                            
May 24 09:19:39 (none) kernel:  cfa86e9c d0abbc48 c6e70324 cf083a54 c6e703a4 c6e70280 00000045 cfa86eac                            
May 24 09:19:39 (none) kernel:  d0aaf508 c6e703a4 c6e703ac cfa86eb8 d0aba743 c6e703a4 cfa86ec4 c10961ec                            
May 24 09:19:39 (none) kernel: Call Trace:                                                                                         
May 24 09:19:39 (none) kernel:  [<d0abbc48>] ? xfs_inode_set_reclaim_tag+0x69/0x88 [xfs]                                           
May 24 09:19:39 (none) kernel:  [<d0aaf508>] ? xfs_reclaim+0x88/0x8e [xfs]                                                         
May 24 09:19:39 (none) kernel:  [<d0aba743>] ? xfs_fs_destroy_inode+0x2a/0x43 [xfs]                                                
May 24 09:19:39 (none) kernel:  [<c10961ec>] ? destroy_inode+0x27/0x38                                                             
May 24 09:19:39 (none) kernel:  [<c10962ad>] ? dispose_list+0xb0/0xd9                                                              
May 24 09:19:39 (none) kernel:  [<c1096486>] ? shrink_icache_memory+0x1b0/0x1dd                                                    
May 24 09:19:39 (none) kernel:  [<c106c673>] ? shrink_slab+0xd8/0x132                                                              
May 24 09:19:39 (none) kernel:  [<c106cac3>] ? kswapd+0x3f6/0x56e                                                                  
May 24 09:19:39 (none) kernel:  [<c106b1e4>] ? isolate_pages_global+0x0/0x1b0                                                      
May 24 09:19:39 (none) kernel:  [<c1034437>] ? autoremove_wake_function+0x0/0x33                                                   
May 24 09:19:39 (none) kernel:  [<c101b1e6>] ? complete+0x34/0x3e                                                                  
May 24 09:19:39 (none) kernel:  [<c106c6cd>] ? kswapd+0x0/0x56e                                                                    
May 24 09:19:39 (none) kernel:  [<c103415c>] ? kthread+0x3b/0x61                                                                   
May 24 09:19:39 (none) kernel:  [<c1034121>] ? kthread+0x0/0x61                                                                    
May 24 09:19:39 (none) kernel:  [<c10037c7>] ? kernel_thread_helper+0x7/0x10                                                       
May 24 09:19:39 (none) kernel: Code: b4 89 42 04 83 c4 4c 89 f0 5b 5e 5f 5d c3 55 89 e5 57 56 53 83 ec 14 89 45 e8 89 55 e4 89 4d e
0 8b 38 3b 14 bd 6c ee 39 c1 76 04 <0f> 0b eb fe 8b 45 e8 8b 55 e0 8b 70 08 6b c7 06 c1 e2 03 89 55                                
May 24 09:19:39 (none) kernel: EIP: [<c10ff322>] radix_tree_tag_set+0x1d/0xa8 SS:ESP 0068:cfa86e60                                 
May 24 09:19:39 (none) kernel: ---[ end trace 06b8f5f516406ef0 ]---

-------
Comment 2 Rafael J. Wysocki 2009-05-25 23:15:06 UTC
On Monday 25 May 2009, Alex Samad wrote:
> Hi
> 
> As far as I know it is still a problem
Comment 3 Rafael J. Wysocki 2009-05-25 23:15:18 UTC
On Monday 25 May 2009, Mike Dresser wrote:
> Confirmed this is still a problem with 2.6.29.4
> 
> Mike
> 
> On Mon, 25 May 2009, Alex Samad wrote:
> 
> > Hi
> >
> > As far as I know it is still a problem
Comment 4 Miah Gregory 2009-05-27 11:49:52 UTC
Confirming this occurs with 2.6.29.4 as well.
Comment 5 Rafael J. Wysocki 2009-06-01 20:20:53 UTC
On Sunday 31 May 2009, Alex Samad wrote:
> Hi
> 
> as far as I know it is still a problem
> 
> On Sat, May 30, 2009 at 09:55:40PM +0200, Rafael J. Wysocki wrote:
> > This message has been generated automatically as a part of a report
> > of regressions introduced between 2.6.28 and 2.6.29.
> > 
> > The following bug entry is on the current list of known regressions
> > introduced between 2.6.28 and 2.6.29.  Please verify if it still should
> > be listed and let me know (either way).
> > 
> > 
> > Bug-Entry   : http://bugzilla.kernel.org/show_bug.cgi?id=13375
> > Subject             : Kernel crash with 2.6.29 + nfs + xfs (radix-tree)
> > Submitter   : Alex Samad <alex@samad.com.au>
> > Date                : 2009-05-20 0:37 (11 days old)
> > References  : http://marc.info/?l=linux-kernel&m=124278675503699&w=4
Comment 6 Andrew Randrianasulu 2009-06-10 11:08:37 UTC
Using 2.6.30 for both client and server i can see another bug early - probably  http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=527517


reconnect_path: npd != pd
reconnect_path: npd != pd
reconnect_path: npd != pd

in syslog on sever and 

/usr/sbin/emerge-webrsync: line 72: tar: command not found
/usr/sbin/emerge-webrsync: line 280: [: -lt: unary operator expected

on client (SGI O2, performing gentoo update)

Same bug was reported on LKML here, i think:
http://lkml.org/lkml/2009/5/21/165
Comment 7 Andrew Randrianasulu 2009-06-10 11:36:24 UTC
http://groups.google.com/group/linux.kernel/browse_thread/thread/8c5bd1f7139a560c/9836e181395ae0ab?lnk=raot

two more traces
-----------

oops.  I've overseen a BUG. this is on 2.6.30-rc8-git2 (with NFS still active):

Jun  7 00:04:29 [kernel] [18059.860915] reconnect_path: npd != pd
Jun  7 00:04:29 [kernel] [18059.876749] reconnect_path: npd != pd
Jun  7 00:04:29 [kernel] [18059.884702] reconnect_path: npd != pd
Jun  7 00:04:29 [kernel] [18060.605674] kernel BUG at lib/radix-tree.c:485!
Jun  7 00:04:29 [kernel] [18060.605689] CPU 1
Jun  7 00:04:29 [kernel] [18060.605693] Modules linked in: usbtouchscreen
dvb_usb_cinergyT2 dummy bonding snd_emu10k1 snd_rawmidi snd_ac97_codec ac97_bus
forcedeth snd_pcm hfcpci snd_page_alloc snd_util_mem snd_hwdep
Jun  7 00:04:29 [kernel] [18060.605721] Pid: 392, comm: kswapd0 Not tainted
2.6.30-rc8-git2 #2 empty
Jun  7 00:04:29 [kernel] [18060.605726] RIP: 0010:[<ffffffff80451ad8>]  
[<ffffffff80451ad8>] radix_tree_tag_set+0x98/0xc0
Jun  7 00:04:29 [kernel] [18060.605743] RSP: 0018:ffff880226b05cd0  EFLAGS:
00010246
Jun  7 00:04:29 [kernel] [18060.605748] RAX: 000000000000000c RBX:
0000000000000000 RCX: 000000000000000c
Jun  7 00:04:29 [kernel] [18060.605752] RDX: 0000000000000000 RSI:
000000000000040c RDI: ffff88022628e360
Jun  7 00:04:29 [kernel] [18060.605757] RBP: ffff880201159c00 R08:
ffff88020119da28 R09: 0000000000000000
Jun  7 00:04:29 [kernel] [18060.605762] R10: 0000000000000000 R11:
0000000000000001 R12: ffff880201159c00
Jun  7 00:04:29 [kernel] [18060.605766] R13: ffff88022505f400 R14:
ffff880201159cf8 R15: ffff88022628e35c
Jun  7 00:04:29 [kernel] [18060.605772] FS:  0000000043d51950(0000)
GS:ffff88002804e000(0000) knlGS:00000000f4ceab90
Jun  7 00:04:29 [kernel] [18060.605777] CS:  0010 DS: 0018 ES: 0018 CR0:
000000008005003b
Jun  7 00:04:29 [kernel] [18060.605782] CR2: 000000000044b7c0 CR3:
00000001e45fe000 CR4: 00000000000006e0
Jun  7 00:04:29 [kernel] [18060.605786] DR0: 0000000000000000 DR1:
0000000000000000 DR2: 0000000000000000
Jun  7 00:04:29 [kernel] [18060.605791] DR3: 0000000000000000 DR6:
00000000ffff0ff0 DR7: 0000000000000400
Jun  7 00:04:29 [kernel] [18060.605796] Process kswapd0 (pid: 392, threadinfo
ffff880226b04000, task ffff880227981620)
Jun  7 00:04:29 [kernel] [18060.605803]  ffff88022628e320 ffffffff8040a1d6
ffff880201159d80 0000000000000071
Jun  7 00:04:29 [kernel] [18060.606011] RIP  [<ffffffff80451ad8>]
radix_tree_tag_set+0x98/0xc0
Jun  7 00:04:30 [kernel] [18060.606018]  RSP <ffff880226b05cd0>
Jun  7 00:04:30 [kernel] [18060.606026] ---[ end trace 0645e929a4fa40ac ]---
Jun  7 00:04:31 [kernel] [18061.930702] reconnect_path: npd != pd
Jun  7 00:04:31 [kernel] [18061.930944] reconnect_path: npd != pd
Jun  7 00:04:32 [kernel] [18063.327856] reconnect_path: npd != pd
Jun  7 00:04:32 [kernel] [18063.328465] reconnect_path: npd != pd
Jun  7 00:04:32 [kernel] [18063.328722] reconnect_path: npd != pd
Jun  7 00:04:32 [kernel] [18063.329075] reconnect_path: npd != pd
Jun  7 00:04:32 [kernel] [18063.329598] reconnect_path: npd != pd
Jun  7 00:04:32 [kernel] [18063.329734] reconnect_path: npd != pd
Jun  7 00:04:32 [kernel] [18063.330410] reconnect_path: npd != pd
Jun  7 00:04:32 [kernel] [18063.330538] reconnect_path: npd != pd 

--------

and 

-------

Mine crashed last night, nothing was logged in the local logfiles, but
fortunately remote syslog got it

Jun  9 01:24:07 x kernel: ------------[ cut here ]------------
Jun  9 01:24:07 x kernel: kernel BUG at lib/radix-tree.c:485!
Jun  9 01:24:07 x kernel: invalid opcode: 0000 [#1] SMP
Jun  9 01:24:07 x kernel: last sysfs file: /sys/class/scsi_host/host0/stats
Jun  9 01:24:07 x kernel: CPU 0
Jun  9 01:24:07 x kernel: Pid: 338, comm: kswapd0 Not tainted 2.6.30-rc8 #2 S2895
Jun  9 01:24:07 x kernel: RIP: 0010:[<ffffffff803a5a04>]  [<ffffffff803a5a04>] radix_tree_tag_set+0x6b/0x9c
Jun  9 01:24:07 x kernel: RSP: 0018:ffff88016e1e9c58  EFLAGS: 00010246
Jun  9 01:24:07 x kernel: RAX: 0000000000000038 RBX: 0000000000000000 RCX: 0000000000000038
Jun  9 01:24:07 x kernel: RDX: 0000000000000000 RSI: 00000000002faaf8 RDI: ffff88016c3b8220
Jun  9 01:24:07 x kernel: RBP: ffff88016e1e9c60 R08: 0000000000000000 R09: ffff8800927460b8
Jun  9 01:24:07 x kernel: R10: 0000000000000001 R11: 0000000000000000 R12: ffff8800666e61c0
Jun  9 01:24:07 x kernel: R13: ffff88016dc21c00 R14: ffff8800666e62c8 R15: ffff88016c3b821c
Jun  9 01:24:07 x kernel: FS:  00007fda3776f6e0(0000) GS:ffff880028028000(0000) knlGS:0000000000000000
Jun  9 01:24:07 x kernel: CS:  0010 DS: 0018 ES: 0018 CR0: 000000008005003b
Jun  9 01:24:07 x kernel: CR2: 00007fda368758e0 CR3: 0000000000201000 CR4: 00000000000006e0
Jun  9 01:24:07 x kernel: DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
Jun  9 01:24:07 x kernel: DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Jun  9 01:24:07 x kernel: Process kswapd0 (pid: 338, threadinfo ffff88016e1e8000, task ffff88016f245fa0)
Jun  9 01:24:07 x kernel: Stack:
Jun  9 01:24:07 x kernel:  ffff88016c3b81e0 ffff88016e1e9ca0 ffffffff8038dcd6 ffff88016e1e9d40
Jun  9 01:24:07 x kernel:  ffff8800666e6350 ffff8800666e61c0 0000000000000048 ffff88016e1e9d40
Jun  9 01:24:07 x kernel:  0000000000000080 ffff88016e1e9cc0 ffffffff8037f2d2 ffff8800666e6350
Jun  9 01:24:07 x kernel: Call Trace:
Jun  9 01:24:07 x kernel:  [<ffffffff8038dcd6>] xfs_inode_set_reclaim_tag+0x71/0x93
Jun  9 01:24:07 x kernel:  [<ffffffff8037f2d2>] xfs_reclaim+0x106/0x10d
Jun  9 01:24:07 x kernel:  [<ffffffff8038c51e>] xfs_fs_destroy_inode+0x37/0x58
Jun  9 01:24:07 x kernel:  [<ffffffff8029dde0>] destroy_inode+0x32/0x47
Jun  9 01:24:07 x kernel:  [<ffffffff8029dec9>] dispose_list+0xd4/0x102
Jun  9 01:24:07 x kernel:  [<ffffffff8029e0f0>] shrink_icache_memory+0x1f9/0x22f
Jun  9 01:24:07 x kernel:  [<ffffffff80269660>] shrink_slab+0xdf/0x154
Jun  9 01:24:07 x kernel:  [<ffffffff80269e13>] kswapd+0x48d/0x62c
Jun  9 01:24:07 x kernel:  [<ffffffff80267765>] ? isolate_pages_global+0x0/0x219
Jun  9 01:24:07 x kernel:  [<ffffffff802481b8>] ? autoremove_wake_function+0x0/0x38
Jun  9 01:24:07 x kernel:  [<ffffffff80269986>] ? kswapd+0x0/0x62c
Jun  9 01:24:07 x kernel:  [<ffffffff80269986>] ? kswapd+0x0/0x62c
Jun  9 01:24:07 x kernel:  [<ffffffff80247e1a>] kthread+0x56/0x83
Jun  9 01:24:07 x kernel:  [<ffffffff8020c9ba>] child_rip+0xa/0x20
Jun  9 01:24:07 x kernel:  [<ffffffff80247dc4>] ? kthread+0x0/0x83
Jun  9 01:24:07 x kernel:  [<ffffffff8020c9b0>] ? child_rip+0x0/0x20
Jun  9 01:24:07 x kernel: Code: 18 02 00 00 48 d3 e8 89 c1 83 e1 3f 41 0f a3 0c 11 19 c0 85 c0 75 07 49 8d 04 11 0f ab 08 48 63 c1 4d 8b 44 c0 18 4d 85 c0 75 04 <0f> 0b eb fe 41 83 eb 06 41 ff ca 45$
Jun  9 01:24:07 x kernel: RIP  [<ffffffff803a5a04>] radix_tree_tag_set+0x6b/0x9c
Jun  9 01:24:07 x kernel:  RSP <ffff88016e1e9c58>
Jun  9 01:24:07 x kernel: ---[ end trace a0564fe308c3b2b4 ]---

CONFIG_XFS_DEBUG was on for this one.

I've noticed it's always kswapd0 that dies?

Mike
-------

Sorry for messing up this bug entry, it was reported to linux-kernel newsgroup, but unnoticed by me.
Comment 8 Stephane 2009-06-12 16:32:35 UTC
can this fix the bug? fs/nfs/write.c gets the lock on the inode before calling  radix_tree_tag_set. I'll test it since I can reproduce the issue.

--- linux-orig/fs/xfs/xfs_vnodeops.c    2009-06-12 11:45:09.841788774 -0400
+++ linux/fs/xfs/xfs_vnodeops.c 2009-06-12 11:47:11.525856430 -0400
@@ -2702,13 +2702,14 @@
         * that has a linux inode being reclaimed. Synchronisation is provided
         * by the i_flags_lock.
         */
+       xfs_ilock(ip, XFS_ILOCK_EXCL);
        if (!ip->i_update_core && (ip->i_itemp == NULL)) {
-               xfs_ilock(ip, XFS_ILOCK_EXCL);
                xfs_iflock(ip);
                xfs_iflags_set(ip, XFS_IRECLAIMABLE);
                return xfs_reclaim_inode(ip, 1, XFS_IFLUSH_DELWRI_ELSE_SYNC);
        }
        xfs_inode_set_reclaim_tag(ip);
+       xfs_iunlock(ip, XFS_ILOCK_EXCL);
        return 0;
 }
 

kernel BUG at lib/radix-tree.c:485!
Jun 11 18:23:50 sledge_a kernel: invalid opcode: 0000 [#1] SMP 
Jun 11 18:23:50 sledge_a kernel: last sysfs file: /sys/class/scsi_host/host0/proc_name
Jun 11 18:23:50 sledge_a kernel: Modules linked in: iptable_filter ip_tables x_tables ipmi_devintf ipmi_si ipmi_msghandler dell_rbu drbd nfsd lockd nfs_acl auth_rpcgss sunr
pc autofs4 i2c_dev i2c_core rdma_ucm rdma_cm iw_cm ib_addr mptctl dm_round_robin dm_multipath mptscsih mptbase scsi_transport_fc ehci_hcd uhci_hcd usbcore ib_mthca iTCO_wdt
 ib_ipoib ib_umad ib_ucm ib_uverbs ib_cm ib_sa ib_mad ib_core myri10ge inet_lro tg3 libphy bnx2 zlib_inflate crc32 bonding ata_piix libata dm_snapshot dm_zero dm_mirror dm_
region_hash dm_log dm_mod megaraid_sas [last unloaded: x_tables]
Jun 11 18:23:50 sledge_a kernel: 
Jun 11 18:23:50 sledge_a kernel: Pid: 283, comm: kswapd0 Not tainted (2.6.29.3-3.7.00_3030MAXT #1) PowerEdge 2900
Jun 11 18:23:50 sledge_a kernel: EIP: 0060:[<c038ad75>] EFLAGS: 00010246 CPU: 2
Jun 11 18:23:50 sledge_a kernel: EIP is at radix_tree_tag_set+0x95/0xa0
Jun 11 18:23:50 sledge_a kernel: EAX: 00000000 EBX: 00000001 ECX: ee850b90 EDX: 00000035
Jun 11 18:23:50 sledge_a kernel: ESI: 00000000 EDI: 00000000 EBP: f484fe18 ESP: f484fdfc
Jun 11 18:23:50 sledge_a kernel:  DS: 007b ES: 007b FS: 00d8 GS: 0000 SS: 0068
Jun 11 18:23:50 sledge_a kernel: Process kswapd0 (pid: 283, ti=f484e000 task=f503f500 task.ti=f484e000)
Jun 11 18:23:50 sledge_a kernel: Stack:
Jun 11 18:23:50 sledge_a kernel:  00000000 00000000 00000035 ef3c348c ef3c3454 e02d8f00 00000000 f484fe2c
Jun 11 18:23:50 sledge_a kernel:  c03633f8 e02d8f00 e02d9024 f484fe94 f484fe38 c03556d6 e02d9024 f484fe50
Jun 11 18:23:50 sledge_a kernel:  c03618cd 00000007 f484fe54 c023bd6f e02d9024 f484fe5c c02926f3 e02d902c
Jun 11 18:23:50 sledge_a kernel: Call Trace:
Jun 11 18:23:50 sledge_a kernel:  [<c03633f8>] ? xfs_inode_set_reclaim_tag+0x78/0xb0
Jun 11 18:23:50 sledge_a kernel:  [<c03556d6>] ? xfs_reclaim+0x46/0xa0
Jun 11 18:23:50 sledge_a kernel:  [<c03618cd>] ? xfs_fs_destroy_inode+0x2d/0x50
Jun 11 18:23:50 sledge_a kernel:  [<c023bd6f>] ? wake_up_bit+0x1f/0x30
Jun 11 18:23:50 sledge_a kernel:  [<c02926f3>] ? destroy_inode+0x23/0x40
Jun 11 18:23:50 sledge_a kernel:  [<c02929de>] ? dispose_list+0x6e/0xe0
Jun 11 18:23:50 sledge_a kernel:  [<c0292c93>] ? prune_icache+0xd3/0x1f0
Jun 11 18:23:50 sledge_a kernel:  [<c0292dc7>] ? shrink_icache_memory+0x17/0x40
Jun 11 18:23:50 sledge_a kernel:  [<c0262229>] ? shrink_slab+0x139/0x1d0
Jun 11 18:23:50 sledge_a kernel:  [<c0263fff>] ? balance_pgdat+0x24f/0x3e0
Jun 11 18:23:50 sledge_a kernel:  [<c0262d10>] ? isolate_pages_global+0x0/0x70
Jun 11 18:23:50 sledge_a kernel:  [<c0264225>] ? kswapd+0x95/0xd0
Jun 11 18:23:50 sledge_a kernel:  [<c023bc60>] ? autoremove_wake_function+0x0/0x50
Jun 11 18:23:50 sledge_a kernel:  [<c023bc60>] ? autoremove_wake_function+0x0/0x50
Jun 11 18:23:50 sledge_a kernel:  [<c0223196>] ? complete+0x46/0x60
Jun 11 18:23:50 sledge_a kernel:  [<c0264190>] ? kswapd+0x0/0xd0
Jun 11 18:23:51 sledge_a kernel:  [<c023b5d4>] ? kthread+0x64/0xa0
Jun 11 18:23:51 sledge_a kernel:  [<c023b570>] ? kthread+0x0/0xa0
Jun 11 18:23:51 sledge_a kernel:  [<c0203a0b>] ? kernel_thread_helper+0x7/0x1c
Jun 11 18:23:51 sledge_a kernel: Code: cf 85 ff 74 1c 8b 4d e8 ba 01 00 00 00 83 c1 15 d3 e2 8b 4d f0 8b 41 04 85 c2 75 05 09 d0 89 41 04 83 c4 10 89 f8 5b 5e 5f 5d c3 <0f>
 0b eb fe 0f 0b eb fe 8d 76 00 55 89 e5 57 89 d7 56 31 f6 53 
Jun 11 18:23:51 sledge_a kernel: EIP: [<c038ad75>] radix_tree_tag_set+0x95/0xa0 SS:ESP 0068:f484fdfc
Jun 11 18:23:51 sledge_a kernel: ---[ end trace 9be37bfd1c72c797 ]---
Comment 9 Stephane 2009-06-12 18:39:14 UTC
Created attachment 21884 [details]
Potential fix in xfs_vnodeops.c

Just looks interesting that you would look into the ip for info without the lock.
The values of ip->i_update_core or ip->i_itemp could change between the if () and the xfs_ilock.

I'm able to reproduced the bug in house so I'll test the change for a few more days, 1 hours is not meaningful.

Sorry if this comes has noise and I'm way off.
Comment 10 Stephane 2009-06-13 01:49:40 UTC
not good.
Comment 11 Stephane 2009-06-15 16:08:56 UTC
in xfs_inode_set_reclaim_tag, I addedd a check:

        if (radix_tree_lookup(&pag->pag_ici_root, ag))
                radix_tree_tag_set(&pag->pag_ici_root, ag, XFS_ICI_RECLAIM_TAG);
        else
                xfs_fs_cmn_err(CE_WARN, mp, "radix tree error on ag %d", ag);

my file system has 488 AG, I got this on the console:

Filesystem "drbd0": radix tree error on ag 636048
Filesystem "drbd0": radix tree error on ag 636048
Filesystem "drbd0": radix tree error on ag 2052
Filesystem "drbd0": radix tree error on ag 2052

Also I changed my patch to unlock before the set_reclaim call.

--- linux-orig/fs/xfs/xfs_vnodeops.c    2009-06-12 11:45:09.841788774 -0400
+++ linux/fs/xfs/xfs_vnodeops.c 2009-06-12 11:47:11.525856430 -0400
@@ -2702,13 +2702,14 @@
         * that has a linux inode being reclaimed. Synchronisation is provided
         * by the i_flags_lock.
         */
+       xfs_ilock(ip, XFS_ILOCK_EXCL);
        if (!ip->i_update_core && (ip->i_itemp == NULL)) {
-               xfs_ilock(ip, XFS_ILOCK_EXCL);
                xfs_iflock(ip);
                xfs_iflags_set(ip, XFS_IRECLAIMABLE);
                return xfs_reclaim_inode(ip, 1, XFS_IFLUSH_DELWRI_ELSE_SYNC);
        }
+       xfs_iunlock(ip, XFS_ILOCK_EXCL);
        xfs_inode_set_reclaim_tag(ip);
        return 0;
 }
Comment 12 Andrew Randrianasulu 2009-06-16 20:25:40 UTC
No, patch from #11 doesn't help me:

Jun 16 21:11:14 (none) kernel: reconnect_path: npd != pd
Jun 16 21:11:39 (none) last message repeated 484 times
Jun 16 22:41:33 (none) kernel: reconnect_path: npd != pd
Jun 16 22:47:52 (none) kernel: reconnect_path: npd != pd
Jun 16 22:48:35 (none) last message repeated 25 times
Jun 16 22:50:44 (none) kernel: reconnect_path: npd != pd
Jun 16 22:50:45 (none) last message repeated 13 times
Jun 17 00:05:36 (none) kernel: reconnect_path: npd != pd
Jun 17 00:08:46 (none) kernel: ------------[ cut here ]------------
Jun 17 00:08:46 (none) kernel: kernel BUG at lib/radix-tree.c:485!
Jun 17 00:08:46 (none) kernel: invalid opcode: 0000 [#1] SMP 
Jun 17 00:08:46 (none) kernel: last sysfs file: /sys/devices/platform/w83627hf.656/in8_input
Jun 17 00:08:46 (none) kernel: Modules linked in: sg sd_mod snd_seq_dummy snd_seq_oss snd_seq_midi_event snd_seq snd_pcm_oss snd_mixer_oss ipv6 nfsd lockd nfs_acl auth_rpcgss sunrpc tuner_simple tuner_types tda9887 tda8290 usb_storage scsi_mod libusual snd_cs4232 snd_emu10k1 snd_mpu401 saa7134 snd_via82xx snd_ac97_codec snd_wavefront snd_wss_lib uhci_hcd ac97_bus ir_common ehci_hcd snd_pcm snd_opl3_lib ppdev videobuf_dma_sg snd_util_mem snd_timer snd_page_alloc videobuf_core snd_mpu401_uart tveeprom usbcore snd_hwdep 8139too shpchp snd_rawmidi emu10k1_gp pci_hotplug snd_seq_device rtc_cmos via_agp rtc_core snd rtc_lib ns558 gameport soundcore parport_pc parport xfs exportfs ufs agpgart tuner v4l2_common videodev v4l1_compat w83627hf hwmon_vid hwmon i2c_viapro i2c_dev
Jun 17 00:08:46 (none) kernel: 
Jun 17 00:08:46 (none) kernel: Pid: 197, comm: kswapd0 Not tainted (2.6.29.4-i486 #4) MS-6380E
Jun 17 00:08:46 (none) kernel: EIP: 0060:[<c10ff373>] EFLAGS: 00010246 CPU: 0
Jun 17 00:08:46 (none) kernel: EIP is at radix_tree_tag_set+0x6e/0xa8
Jun 17 00:08:46 (none) kernel: EAX: cf52ea50 EBX: 0000001d ECX: 0000001d EDX: 00000000
Jun 17 00:08:46 (none) kernel: ESI: 00000000 EDI: 00000002 EBP: cfa51e80 ESP: cfa51e60
Jun 17 00:08:46 (none) kernel:  DS: 007b ES: 007b FS: 00d8 GS: 0000 SS: 0068
Jun 17 00:08:46 (none) kernel: Process kswapd0 (pid: 197, ti=cfa51000 task=cfa5aba0 task.ti=cfa51000)
Jun 17 00:08:47 (none) kernel: Stack:
Jun 17 00:08:47 (none) kernel:  00000000 0078777e cf047a9c 00000006 00000000 ced05900 cf1e9400 cf047a64
Jun 17 00:08:47 (none) kernel:  cfa51e9c d1411470 ced059ac cf047a98 ced05a2c ced05900 00000004 cfa51eac
Jun 17 00:08:47 (none) kernel:  d1403af0 ced05a2c ced05a34 cfa51eb8 d140fea1 ced05a2c cfa51ec4 c10961ec
Jun 17 00:08:47 (none) kernel: Call Trace:
Jun 17 00:08:47 (none) kernel:  [<d1411470>] ? xfs_inode_set_reclaim_tag+0x69/0x88 [xfs]
Jun 17 00:08:47 (none) kernel:  [<d1403af0>] ? xfs_reclaim+0xe7/0xed [xfs]
Jun 17 00:08:47 (none) kernel:  [<d140fea1>] ? xfs_fs_destroy_inode+0x2a/0x43 [xfs]
Jun 17 00:08:47 (none) kernel:  [<c10961ec>] ? destroy_inode+0x27/0x38
Jun 17 00:08:47 (none) kernel:  [<c10962ad>] ? dispose_list+0xb0/0xd9
Jun 17 00:08:47 (none) kernel:  [<c1096486>] ? shrink_icache_memory+0x1b0/0x1dd
Jun 17 00:08:47 (none) kernel:  [<c106c673>] ? shrink_slab+0xd8/0x132
Jun 17 00:08:47 (none) kernel:  [<c106cac3>] ? kswapd+0x3f6/0x56e
Jun 17 00:08:47 (none) kernel:  [<c106b1e4>] ? isolate_pages_global+0x0/0x1b0
Jun 17 00:08:47 (none) kernel:  [<c1034437>] ? autoremove_wake_function+0x0/0x33
Jun 17 00:08:47 (none) kernel:  [<c101b1e6>] ? complete+0x34/0x3e
Jun 17 00:08:47 (none) kernel:  [<c106c6cd>] ? kswapd+0x0/0x56e
Jun 17 00:08:47 (none) kernel:  [<c103415c>] ? kthread+0x3b/0x61
Jun 17 00:08:47 (none) kernel:  [<c1034121>] ? kthread+0x0/0x61
Jun 17 00:08:47 (none) kernel:  [<c10037c7>] ? kernel_thread_helper+0x7/0x10
Jun 17 00:08:47 (none) kernel: Code: 8b 5d e4 8b 55 e0 d3 eb 83 e3 3f 89 d9 e8 5f fc ff ff 85 c0 75 0d 8b 55 f0 8d 84 16 10 01 00 00 0f ab 18 8b 74 9e 10 85 f6 75 04 <0f> 0b eb fe 83 6d ec 06 4f 85 ff 75 c3 85 f6 74 1f 8b 4d e8 b8 
Jun 17 00:08:47 (none) kernel: EIP: [<c10ff373>] radix_tree_tag_set+0x6e/0xa8 SS:ESP 0068:cfa51e60
Jun 17 00:08:47 (none) kernel: ---[ end trace 7f1d5c5d799fd802 ]---
Comment 13 Tommy 2009-06-17 08:39:01 UTC
Created attachment 21962 [details]
Screenshot of same problem

Hello we're experiencing the same problem. Sorry i didn't have text based logging but i could capture a screenshot. I'll attach it so it may aid in further debugging.
Comment 14 Stephane 2009-06-17 16:31:54 UTC
don't forget to run xfs_repair. But I agree, the patch just buy you some time, there is still an issue.  2.6.28.10 seams to be the last version where XFS is usable for file sharing over NFS.
Comment 15 Stephane 2009-06-17 16:46:06 UTC
(In reply to comment #13)
> Created an attachment (id=21962) [details]
> Screenshot of same problem
> 
> Hello we're experiencing the same problem. Sorry i didn't have text based
> logging but i could capture a screenshot. I'll attach it so it may aid in
> further debugging.

Different problem, this bug is related to radix-tree being passed an invalid id from xfs in the set_reclaim_flag.
Comment 16 Tommy 2009-06-17 17:16:05 UTC
(In reply to comment #15)
> Different problem, this bug is related to radix-tree being passed an invalid
> id
> from xfs in the set_reclaim_flag.

I wouldn't be so sure it's a different problem. As all traces above, including mine, all crash after xfs_fs_destroy_inode.
Comment 17 Gabriel Barazer 2009-06-25 08:21:30 UTC
Hi,

Same bug here on 2.6.29.4 on the server, various versions on the clients (2.6.27 / 2.6.29). I tried to disable the swap space to not use kswapd with no success and the same bug. I also have my syslog filled with "reconnect_path: npd != pd" from the kernel log and stale NFS file handles one the clients (I suspect there is 1 line each time the client tries to access but get a stale NFS file handle error), but I'm not sure this is related. Here are my backtraces, all related to lib/radix-tree.c:473 AND 485:


   [43251.547520] ------------[ cut here ]------------
   [43251.547714] kernel BUG at lib/radix-tree.c:485!
   [43251.547902] invalid opcode: 0000 [#1] SMP
   [43251.548151] last sysfs file: /sys/devices/virtual/block/dm-11/dm/name
   [43251.548345] CPU 7
   [43251.548564] Pid: 52, comm: kswapd0 Not tainted 2.6.29.4-filer #3 X7DCX
   [43251.548758] RIP: 0010:[<ffffffff811e5fca>] [<ffffffff811e5fca>] radix_tree_tag_set+0x88/0xc8
   [43251.549144] RSP: 0018:ffff88031d3d9c78 EFLAGS: 00010246
   [43251.549336] RAX: ffff8802639eff58 RBX: 0000000000000000 RCX: 000000000000001d
   [43251.549532] RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffff8802639efd40
   [43251.549727] RBP: 0000000000000001 R08: ffff8801edd2c8c0 R09: ffff8801edd2c798
   [43251.549751] R10: ffff88031cd86800 R11: ffffffff811bb75b R12: ffff88031cdd7040
   [43251.549751] R13: 0000000000000000 R14: 000000000053159d R15: 0000000000000000
   [43251.549751] FS: 0000000000000000(0000) GS:ffff88031f849200(0000) knlGS:0000000000000000
   [43251.549751] CS: 0010 DS: 0018 ES: 0018 CR0: 000000008005003b
   [43251.549751] CR2: 00007fcb6fc1d2d5 CR3: 00000002d0dd9000 CR4: 00000000000406e0
   [43251.549751] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
   [43251.549751] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
   [43251.549751] Process kswapd0 (pid: 52, threadinfo ffff88031d3d8000, task ffff88031fa44fa0)
   [43251.549751] Stack:
   [43251.549751] ffff88030000001d 0000000000000000 ffff8801edd2c740 ffff88031cdd7000
   [43251.549751] ffff88031cdd703c ffff8801edd2c838 ffff88031cd86800 ffff88031d3d9d50
   [43251.552255] ffffffff811bcecb ffff8801edd2c740 ffff8801edd2c8d0 000000000000007b
   [43251.553006] Call Trace:
   [43251.553006] [<ffffffff811bcecb>] ? xfs_inode_set_reclaim_tag+0x65/0x80
   [43251.553006] [<ffffffff811af041>] ? xfs_reclaim+0x99/0x9f
   [43251.553006] [<ffffffff811bb791>] ? xfs_fs_destroy_inode+0x36/0x54
   [43251.553006] [<ffffffff81091aa3>] ? dispose_list+0xc1/0xf0
   [43251.553006] [<ffffffff81091cc6>] ? shrink_icache_memory+0x1f4/0x226
   [43251.553006] [<ffffffff81063f7a>] ? shrink_slab+0xde/0x154
   [43251.553006] [<ffffffff81064433>] ? kswapd+0x443/0x59d
   [43251.553006] [<ffffffff810621d9>] ? isolate_pages_global+0x0/0x25b
   [43251.553006] [<ffffffff81043a1f>] ? autoremove_wake_function+0x0/0x2a
   [43251.553006] [<ffffffff81063ff0>] ? kswapd+0x0/0x59d
   [43251.553006] [<ffffffff8104370b>] ? kthread+0x47/0x71
   [43251.553006] [<ffffffff8100ceba>] ? child_rip+0xa/0x20
   [43251.556254] [<ffffffff810436c4>] ? kthread+0x0/0x71
   [43251.556254] [<ffffffff8100ceb0>] ? child_rip+0x0/0x20
   [43251.556254] Code: 0c 24 e8 de fb ff ff 8b 0c 24 85 c0 75 10 48 8b 54 24 08 48 8d 84 13 18 02 00 00 0f ab 08 48 63 c9 48 8b 5c cb 18 48 85 db 75 04 <0f> 0b eb fe 41 83 ef 06 ff cd 85 ed 75 b0 48 85 db 74 1b 41 8b
   [43251.557004] RIP [<ffffffff811e5fca>] radix_tree_tag_set+0x88/0xc8
   [43251.557004] RSP <ffff88031d3d9c78>
   [43251.560005] ---[ end trace 7d232a2ba1ee9003 ]---



   [504384.225844] ------------[ cut here ]------------
   [504384.226057] kernel BUG at lib/radix-tree.c:473!
   [504384.226249] invalid opcode: 0000 [#1] SMP
   [504384.226504] last sysfs file: /sys/devices/virtual/block/dm-4/dm/name
   [504384.226697] CPU 2
   [504384.226911] Pid: 52, comm: kswapd0 Not tainted 2.6.29.4-filer #3 X7DCX
   [504384.227103] RIP: 0010:[<ffffffff811e5f67>] [<ffffffff811e5f67>] radix_tree_tag_set+0x25/0xc8
   [504384.227487] RSP: 0018:ffff88031d3d9c78 EFLAGS: 00010206
   [504384.227678] RAX: 0000000000000000 RBX: ffff88031bd0e180 RCX: 000000000000001a
   [504384.228019] RDX: 0000000000000000 RSI: 0000000000000100 RDI: ffff88031faf7290
   [504384.228359] RBP: 0000000000000000 R08: ffff88031bd0e300 R09: 0000000000000052
   [504384.228698] R10: ffff88031bd0ddc0 R11: ffffffff811bb75b R12: ffff88031faf7290
   [504384.229036] R13: 0000000000000000 R14: 0000000000000100 R15: ffff88031d3d9d50
   [504384.229175] FS: 0000000000000000(0000) GS:ffff88031f848300(0000) knlGS:0000000000000000
   [504384.229175] CS: 0010 DS: 0018 ES: 0018 CR0: 000000008005003b
   [504384.229175] CR2: 00007f2e4ca912d5 CR3: 000000029ae62000 CR4: 00000000000406e0
   [504384.229175] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
   [504384.229175] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
   [504384.229175] Process kswapd0 (pid: 52, threadinfo ffff88031d3d8000, task ffff88031fa44fa0)
   [504384.229175] Stack:
   [504384.229175] ffff88031cc55800 0000000000000202 ffff88031bd0e180 ffff88031faf7250
   [504384.229175] ffff88031faf728c ffff88031bd0e278 ffff88031c9d5000 ffff88031d3d9d50
   [504384.229175] ffffffff811bcecb ffff88031bd0e180 ffff88031bd0e310 0000000000000053
   [504384.232254] Call Trace:
   [504384.232254] [<ffffffff811bcecb>] xfs_inode_set_reclaim_tag+0x65/0x80
   [504384.232254] [<ffffffff811af041>] xfs_reclaim+0x99/0x9f
   [504384.232254] [<ffffffff811bb791>] xfs_fs_destroy_inode+0x36/0x54
   [504384.232254] [<ffffffff81091aa3>] dispose_list+0xc1/0xf0
   [504384.232254] [<ffffffff81091cc6>] shrink_icache_memory+0x1f4/0x226
   [504384.232254] [<ffffffff81063f7a>] shrink_slab+0xde/0x154
   [504384.232254] [<ffffffff81064433>] kswapd+0x443/0x59d
   [504384.232254] [<ffffffff810621d9>] isolate_pages_global+0x0/0x25b
   [504384.232254] [<ffffffff81043a1f>] autoremove_wake_function+0x0/0x2a
   [504384.232254] [<ffffffff81063ff0>] kswapd+0x0/0x59d
   [504384.232254] [<ffffffff8104370b>] kthread+0x47/0x71
   [504384.232254] [<ffffffff8100ceba>] child_rip+0xa/0x20
   [504384.232254] [<ffffffff810436c4>] kthread+0x0/0x71
   [504384.232254] [<ffffffff8100ceb0>] child_rip+0x0/0x20
   [504384.232254] Code: 5d 41 5e 41 5f c3 41 57 41 56 49 89 f6 41 55 41 89 d5 41 54 49 89 fc 55 53 48 83 ec 10 8b 2f 89 e8 48 3b 34 c5 e0 7f 62 81 76 04 <0f> 0b eb fe 48 8b 5f 08 44 6b fd 06 89 d0 48 83 e3 fe 48 c1 e0
   [504384.236253] RIP [<ffffffff811e5f67>] radix_tree_tag_set+0x25/0xc8
   [504384.236253] RSP <ffff88031d3d9c78>
   [504384.239401] ---[ end trace 6f17673e48e005b5 ]---


   [ 7256.039741] ------------[ cut here ]------------
   [ 7256.039948] kernel BUG at lib/radix-tree.c:485!
   [ 7256.040136] invalid opcode: 0000 [#1] SMP
   [ 7256.040388] last sysfs file: /sys/devices/pci0000:00/0000:00:1a.7/usb1/1-2/1-2:1.0/host3/target3:0:0/3:0:0:0/block/sdf/uevent
   [ 7256.040739] CPU 3
   [ 7256.040959] Pid: 52, comm: kswapd0 Not tainted 2.6.29.4-filer #3 X7DCX
   [ 7256.041155] RIP: 0010:[<ffffffff811e5fca>] [<ffffffff811e5fca>] radix_tree_tag_set+0x88/0xc8
   [ 7256.041540] RSP: 0018:ffff88031d3d9c78 EFLAGS: 00010246
   [ 7256.041727] RAX: ffff880243c6c448 RBX: 0000000000000000 RCX: 000000000000002d
   [ 7256.041918] RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffff880243c6c230
   [ 7256.042110] RBP: 0000000000000001 R08: ffff8802a0886e40 R09: ffff8802d087b680
   [ 7256.042302] R10: ffff88031cdc0400 R11: ffffffff811bb75b R12: ffff88031faf7640
   [ 7256.042495] R13: 0000000000000000 R14: 00000000001423ed R15: 0000000000000000
   [ 7256.042688] FS: 0000000000000000(0000) GS:ffff88031f848600(0000) knlGS:0000000000000000
   [ 7256.043028] CS: 0010 DS: 0018 ES: 0018 CR0: 000000008005003b
   [ 7256.043219] CR2: 0000000001934fe8 CR3: 00000001b9127000 CR4: 00000000000406e0
   [ 7256.043386] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
   [ 7256.043386] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
   [ 7256.043386] Process kswapd0 (pid: 52, threadinfo ffff88031d3d8000, task ffff88031fa44fa0)
   [ 7256.043386] Stack:
   [ 7256.043386] ffff88030000002d 0000000000000000 ffff8802a0886cc0 ffff88031faf7600
   [ 7256.043386] ffff88031faf763c ffff8802a0886db8 ffff88031cdc0400 ffff88031d3d9d50
   [ 7256.043386] ffffffff811bcecb ffff8802a0886cc0 ffff8802a0886e50 0000000000000005
   [ 7256.043386] Call Trace:
   [ 7256.043386] [<ffffffff811bcecb>] ? xfs_inode_set_reclaim_tag+0x65/0x80
   [ 7256.043386] [<ffffffff811af041>] ? xfs_reclaim+0x99/0x9f
   [ 7256.043386] [<ffffffff811bb791>] ? xfs_fs_destroy_inode+0x36/0x54
   [ 7256.043386] [<ffffffff81091aa3>] ? dispose_list+0xc1/0xf0
   [ 7256.043386] [<ffffffff81091cc6>] ? shrink_icache_memory+0x1f4/0x226
   [ 7256.043386] [<ffffffff81063f7a>] ? shrink_slab+0xde/0x154
   [ 7256.043386] [<ffffffff81064433>] ? kswapd+0x443/0x59d
   [ 7256.043386] [<ffffffff810621d9>] ? isolate_pages_global+0x0/0x25b
   [ 7256.043386] [<ffffffff81043a1f>] ? autoremove_wake_function+0x0/0x2a
   [ 7256.043386] [<ffffffff81063ff0>] ? kswapd+0x0/0x59d
   [ 7256.043386] [<ffffffff8104370b>] ? kthread+0x47/0x71
   [ 7256.043386] [<ffffffff8100ceba>] ? child_rip+0xa/0x20
   [ 7256.043386] [<ffffffff810436c4>] ? kthread+0x0/0x71
   [ 7256.043386] [<ffffffff8100ceb0>] ? child_rip+0x0/0x20
   [ 7256.043386] Code: 0c 24 e8 de fb ff ff 8b 0c 24 85 c0 75 10 48 8b 54 24 08 48 8d 84 13 18 02 00 00 0f ab 08 48 63 c9 48 8b 5c cb 18 48 85 db 75 04 <0f> 0b eb fe 41 83 ef 06 ff cd 85 ed 75 b0 48 85 db 74 1b 41 8b
   [ 7256.043386] RIP [<ffffffff811e5fca>] radix_tree_tag_set+0x88/0xc8
   [ 7256.043386] RSP <ffff88031d3d9c78>
   [ 7256.052293] ---[ end trace c542ca92b3353b53 ]---

Note: I switched to 2.6.30 tonight and didn't get any "kernel BUG" message since (finger crossed), but the "reconnect_path: npd != pd" bug, forced me to reboot the server to finally mount the NFS exports on the client without getting a stale NFS file handle error at mount time.
Comment 18 Leandro Lucarella 2009-06-27 19:44:06 UTC
I'm having a lot of "reconnect_path: npd != pd" too, with both 2.6.29 and 2.6.30 (2.6.28 works fine). I'm using NFS to export an XFS filesystem in the computer I'm getting that messages
Comment 19 Daniel Burr 2009-07-06 11:38:07 UTC
I was also seeing a lot of "reconnect_path: npd != pd" messages with 2.6.30.1 running on an NFS server which was exporting an XFS filesystem.  Clients were unable to mount their nfsroot.

Updating to the latest git version of XFS (http://oss.sgi.com/cgi-bin/gitweb.cgi?p=xfs/xfs.git;a=summary) as of 5th July (ab8b9baac3f48fb3fc3d47790950c0eec134e678) seems to have fixed it .  I suspect that the responsible changeset was http://oss.sgi.com/cgi-bin/gitweb.cgi?p=xfs/xfs.git;a=commit;h=ef14f0c1578dce4b688726eb2603e50b62d6665a but I have not done a bisect to verify this.
Comment 20 Daniel Burr 2009-07-12 04:05:43 UTC
After 6 days uptime on my NFS server the "reconnect_path: npd != pd" returned which means that the XFS update described in #19 does *not* fix the problem.
Comment 21 Eric Sandeen 2009-07-15 03:57:35 UTC
More info from a post on the xfs list:

            radix_tree_tag_set/xfs_inode_set_reclaim_tag crash

I keep getting kernel crashes with xfs+lvm2+mdadm (raid6) - correct in sync, 
all xfs partitions checked for corruption (but there were none, but the crashes persists).

The raid6 has just resynced now because of this kernel hang.

2.6.30 and 2.6.30.1 kernels on my nfsv3 server keep crashing, both with/without SMP, dynticks,
 selinux (although selinux for some reason seems to make it crash very often)

The machine has been memtested (memtest86+) for 14hours straight, never any stability issues.

I have around 1-3 complete kernel lockups a day with this nfs kernel server and xfs.

Tried nfs both as module and direct in kernel - both hangs the kernel
 completely (can't even use magic sysrq) when the client uses lots of small files/lots of IO.

The remote export of samba is rock stable, nfs keeps crashing with small files, 
without nfs there seems to be no crashes.

The /srv/diskless dir is a 80GB dir with lots of small files (kernels etc)

I also get a 
    "svc: failed to register lockdv1 RPC service (errno 97)."
in dmesg havn't seen that before in kernels below 2.6.30

Also lots of
    [xxxxx.yyyyy] reconnect_path: npd != pd (see ***)

And stale NFS handles on clients sometimes.

Everything was stable on server+client until the server got 2.6.30 kernel, 
the mdadm raid6 only works in 2.6.30 or above - mdadm fails to initialize it in 
anything else, so cannot downgrade (custom reshape to raid6 using echo into /sys,
 all Q blocks on one disk).

I have experimented with mount options on the client, and the client has 
been stable with these mount options before, when the server had a kernel
 below 2.6.30

Ways to reproduce:

o 2.6.30 or 2.6.31 kernel on nfs server

o xfs exports on server with /etc/exports and /etc/fstab on client as pasted below

o nfs-kernel-server either as module loaded or in kernel.

o Async on the client seems to make it more reproducible

o dd if=/nfs/largefile of=/dev/null bs=4k on the client can trigger 
    a kernel oops on the server in a few tries

o copying over a large folder with lots of files on the client 
    import from server will trigger it.

o selinux? seems to make it more unstable - i got instant kernel crash with 
    selinux options on kernel when the nfsd started - now removed but problem is still there.


I saw someone talking about kernel stack size would be the cause for this xfs+nfs problem, 
is there anything to this?

Here is the most common crash trace:
http://rlogin.dk/IMG_7155.JPG    [1]

A bugreport has already been filed, but no known solution:
http://bugzilla.kernel.org/show_bug.cgi?id=13375
http://www.google.com/search?hl=da&q=xfs+radix (lots of results but no known solution)

The below trace (at the bottom of this mail is not as common as the one in the link [1])

SERVER INFO

root@mfs:~# rpcinfo -p
   program vers proto   port
    100000    2   tcp    111  portmapper
    100000    2   udp    111  portmapper
    100024    1   udp  58792  status
    100024    1   tcp  43201  status
    100003    2   udp   2049  nfs
    100003    3   udp   2049  nfs
    100003    4   udp   2049  nfs
    100021    1   udp  51962  nlockmgr
    100021    3   udp  51962  nlockmgr
    100021    4   udp  51962  nlockmgr
    100021    1   tcp  57205  nlockmgr
    100021    3   tcp  57205  nlockmgr
    100021    4   tcp  57205  nlockmgr
    100003    2   tcp   2049  nfs
    100003    3   tcp   2049  nfs
    100003    4   tcp   2049  nfs
    100005    1   udp  44137  mountd
    100005    1   tcp  46627  mountd
    100005    2   udp  44137  mountd
    100005    2   tcp  46627  mountd
    100005    3   udp  44137  mountd
    100005    3   tcp  46627  mountd

Module                  Size  Used by
xts                     2612  4 
gf128mul                7020  1 xts
nfsd                  208736  9 
lockd                  56984  1 nfsd
nfs_acl                 2384  1 nfsd
auth_rpcgss            31180  1 nfsd
sunrpc                150648  10 nfsd,lockd,nfs_acl,auth_rpcgss
uhci_hcd               17252  0 
tun                    11040  0 
sg                     22332  0 
usb_storage            45104  1 
e1000                 101476  0 
forcedeth              46244  0 
pata_amd                9100  0 
ata_generic             4184  0 
sd_mod                 21592  12 
ehci_hcd               26968  0 
usbcore               104356  4 uhci_hcd,usb_storage,ehci_hcd
xfs                   417604  12 
exportfs                3408  2 nfsd,xfs
linear                  4608  0 

/bigdaddy           *.local(rw,async,insecure,no_subtree_check,no_root_squash)

/crypt/scan         *.local(rw,async,insecure,no_subtree_check,no_root_squash)
/crypt/backup       *.local(rw,async,insecure,no_subtree_check,no_root_squash)
/crypt/pictures     *.local(rw,async,insecure,no_subtree_check,no_root_squash)

/crypt/private/music    mws*.local(rw,async,insecure,no_subtree_check,no_root_squash)
/crypt/private          mws*.local(rw,async,insecure,no_subtree_check,no_root_squash)

/bigdaddy/Music     *.local(ro,async,insecure,no_subtree_check,no_root_squash)
/torrents           *.local(rw,async,insecure,no_subtree_check,no_root_squash)


/srv/diskless/mws              *.local(rw,async,insecure,no_subtree_check,no_root_squash)
/srv/diskless/mfs              *.local(rw,async,insecure,no_subtree_check,no_root_squash)
/srv/diskless/generic          *.local(rw,async,insecure,no_subtree_check,no_root_squash)

/srv/diskless/tftp/kernels/src              *.local(rw,async,insecure,no_subtree_check,no_root_squash)

DISKLESS CLIENT
michael@mws:~% cat /etc/fstab 
cpq:/diskless/mws                       /               nfs             proto=udp                               0   0
none                                    /proc           proc            defaults                                0   0
tmpfs                                   /tmp            tmpfs           rw,size=1G                              0   0
mfs:/srv/diskless/tftp/kernels/src      /usr/src        nfs             noauto,defaults                         0   0
mfs:/srv/michael/.private/latex         /latex          nfs             proto=udp                               0   0
mfs:/crypt/private/music                /nfs/music      nfs             proto=udp                               0   0
mfs:/crypt/private                      /nfs/private    nfs             proto=udp                               0   0

mfs:/bigdaddy                           /nfs/bigdaddy   nfs             rw,user,exec,proto=udp                  0   0
mfs:/torrents                           /nfs/torrents   nfs             rw,user,exec,proto=udp                  0   0
mfs:/crypt/pictures                     /nfs/pictures   nfs             rw,user,exec,proto=udp                  0   0
mfs:/crypt/scan                         /nfs/scan       nfs             rw,user,exec,rsize=4096,wsize=4096      0   0
mfs:/crypt/backup                       /nfs/backup     nfs             rw,user,exec,proto=udp                  0   0

/usr/src/diskless_mws                   /usr/src/linux  bind            noauto,defaults,bind                    0   0
/dev/ipod                               /ipod           vfat            defaults,user,noauto,umask=000          0   0

michael@mws:~% uname -r
2.6.22.1mws_diskless

SERVER DMESG (i also have a lot of radix_tree hangs but i dont have a 
trace for them except the [1] picture, they didn't get logged,
 but they seem more common - they crash the kernel completely):

***:
this is not the newest dump, but it is one of the dumps that I have of it:
normally there is also a svc: failed to register lockdv1 RPC service (errno 97). in dmesg
and a lot of 
[xxxxx.yyyyy] reconnect_path: npd != pd
[xxxxx.yyyyy] reconnect_path: npd != pd
[xxxxx.yyyyy] reconnect_path: npd != pd
[xxxxx.yyyyy] reconnect_path: npd != pd
[xxxxx.yyyyy] reconnect_path: npd != pd
[xxxxx.yyyyy] reconnect_path: npd != pd
[xxxxx.yyyyy] reconnect_path: npd != pd
[xxxxx.yyyyy] reconnect_path: npd != pd
[xxxxx.yyyyy] reconnect_path: npd != pd
[xxxxx.yyyyy] reconnect_path: npd != pd
[xxxxx.yyyyy] reconnect_path: npd != pd
(easily 100 of those in dmesg on the server when the client uses lots of files or bandwidth)

the kernel oops always comes after these messages, not before and is always a 
reclaim inode bug or xfs radix tree bug (i think the bug happens when the client tries to -delete- files).
an aMule client or rtorrent on the client will trigger the oops easily - or just deleting a large folder 
or mv'ing one on the client.

[  117.895574] BUG: unable to handle kernel NULL pointer dereference at 00000004
[  117.895749] IP: [<c113d95c>] inode_has_perm+0x1e/0x62
[  117.895883] *pde = 00000000
[  117.896011] Oops: 0000 [#4] SMP
[  117.896167] last sysfs file: /sys/kernel/uevent_seqnum
[  117.896269] Modules linked in: uhci_hcd usb_storage sg sr_mod ehci_hcd cdrom forcedeth ohci_hcd usbcore raid10 raid0 pata_amd ata_generic aic7xxx scsi_transport_spi sd_mod
[  117.897007]
[  117.897097] Pid: 3799, comm: nfsd Tainted: G      D    (2.6.30 #14) System Product Name
[  117.897254] EIP: 0060:[<c113d95c>] EFLAGS: 00010246 CPU: 0
[  117.897351] EIP is at inode_has_perm+0x1e/0x62
[  117.897445] EAX: 00000000 EBX: 00000000 ECX: 00000002 EDX: f2ba0424
[  117.897543] ESI: f1b90380 EDI: f194ce80 EBP: f194ce80 ESP: f5719e2c
[  117.897640]  DS: 007b ES: 007b FS: 00d8 GS: 0000 SS: 0068
[  117.897737] Process nfsd (pid: 3799, ti=f5718000 task=f687e080 task.ti=f5718000)
[  117.897891] Stack:
[  117.897980]  00000002 f1aec0c0 f5bd6000 46000000 f5690000 f5690010 c10ca3a8 00000020
[  117.898284]  00000018 f1be5a80 f269a36c f5690010 c106d73d f269a300 f194ceec 00000002
[  117.898708]  f1b90380 f2ba0424 f194ce80 c1140568 00000000 f1b90380 f2b7b660 f2ba0424
[  117.899191] Call Trace:
[  117.899191]  [<c10ca3a8>] ? nfsd_setuser_and_check_port+0x53/0x58
[  117.899191]  [<c106d73d>] ? kmemdup+0x16/0x30
[  117.899191]  [<c1140568>] ? selinux_dentry_open+0xd6/0xdc
[  117.899191]  [<c113a820>] ? security_dentry_open+0xc/0xd
[  117.899191]  [<c1082c9e>] ? __dentry_open+0xfb/0x208
[  117.899191]  [<c1082e0c>] ? dentry_open+0x61/0x68
[  117.899191]  [<c10cbf9b>] ? nfsd_open+0x16b/0x1a0
[  117.899191]  [<c10cc309>] ? nfsd_read+0x64/0x9f
[  117.899191]  [<c10c9b5d>] ? nfsd_proc_read+0x109/0x13d
[  117.899191]  [<c134013c>] ? cache_check+0x52/0x414
[  117.899191]  [<c102d385>] ? groups_alloc+0x2a/0x94
[  117.899191]  [<c10d0685>] ? nfssvc_decode_readargs+0x8a/0xde
[  117.899191]  [<c10c7e1e>] ? nfsd_dispatch+0xca/0x196
[  117.899191]  [<c1339a3b>] ? svc_process+0x379/0x656
[  117.899191]  [<c10c8299>] ? nfsd+0xde/0x11a
[  117.899191]  [<c10c81bb>] ? nfsd+0x0/0x11a
[  117.899191]  [<c10317c8>] ? kthread+0x42/0x67
[  117.899191]  [<c1031786>] ? kthread+0x0/0x67
[  117.899191]  [<c100320f>] ? kernel_thread_helper+0x7/0x10
[  117.899191] Code: a0 ef ff ff 5b 5e eb 02 31 c0 5b 5e c3 55 57 56 53 83 ec 3c 89 c7 89 0c 24 8b 5c 24 50 31 c0 f6 82 4d 01 00 00 02 75 3f 8b 47 58 <8b> 68 04 8b b2 54 01 00 00 85 db 75 1a b9 0e 00 00 00 8d 7c 24
[  117.899191] EIP: [<c113d95c>] inode_has_perm+0x1e/0x62 SS:ESP 0068:f5719e2c
[  117.899191] CR2: 0000000000000004
[  117.904295] ---[ end trace df59a076396b4ee6 ]---
[  251.771477] BUG: unable to handle kernel NULL pointer dereference at 00000004
[  251.771640] IP: [<c113d95c>] inode_has_perm+0x1e/0x62
[  251.771771] *pde = 00000000
[  251.771892] Oops: 0000 [#5] SMP
[  251.772041] last sysfs file: /sys/kernel/uevent_seqnum
[  251.772137] Modules linked in: uhci_hcd usb_storage sg sr_mod ehci_hcd cdrom forcedeth ohci_hcd usbcore raid10 raid0 pata_amd ata_generic aic7xxx scsi_transport_spi sd_mod
[  251.772876]
[  251.772974] Pid: 3798, comm: nfsd Tainted: G      D    (2.6.30 #14) System Product Name
[  251.772974] EIP: 0060:[<c113d95c>] EFLAGS: 00010246 CPU: 0
[  251.772974] EIP is at inode_has_perm+0x1e/0x62
[  251.772974] EAX: 00000000 EBX: 00000000 ECX: 00000002 EDX: f45bf324
[  251.772974] ESI: f53f2700 EDI: f3d5e400 EBP: f3d5e400 ESP: f569be2c
[  251.772974]  DS: 007b ES: 007b FS: 00d8 GS: 0000 SS: 0068
[  251.772974] Process nfsd (pid: 3798, ti=f569a000 task=f71ab4d0 task.ti=f569a000)
[  251.772974] Stack:
[  251.772974]  00000002 f27d2e40 f5bd5000 46000000 f5587a00 f5587a10 c10ca3a8 00000020
[  251.772974]  00000018 f1bdddc0 f53f2eec f5587a10 c106d73d f53f2e80 f3d5e46c 00000002
[  251.772974]  f53f2700 f45bf324 f3d5e400 c1140568 00000000 f53f2700 f45d8110 f45bf324
[  251.772974] Call Trace:
[  251.772974]  [<c10ca3a8>] ? nfsd_setuser_and_check_port+0x53/0x58
[  251.772974]  [<c106d73d>] ? kmemdup+0x16/0x30
[  251.772974]  [<c1140568>] ? selinux_dentry_open+0xd6/0xdc
[  251.772974]  [<c113a820>] ? security_dentry_open+0xc/0xd
[  251.772974]  [<c1082c9e>] ? __dentry_open+0xfb/0x208
[  251.772974]  [<c1082e0c>] ? dentry_open+0x61/0x68
[  251.772974]  [<c10cbf9b>] ? nfsd_open+0x16b/0x1a0
[  251.772974]  [<c10cc309>] ? nfsd_read+0x64/0x9f
[  251.772974]  [<c10c9b5d>] ? nfsd_proc_read+0x109/0x13d
[  251.772974]  [<c134013c>] ? cache_check+0x52/0x414
[  251.772974]  [<c102d385>] ? groups_alloc+0x2a/0x94
[  251.772974]  [<c10d0685>] ? nfssvc_decode_readargs+0x8a/0xde
[  251.772974]  [<c10c7e1e>] ? nfsd_dispatch+0xca/0x196
[  251.772974]  [<c1339a3b>] ? svc_process+0x379/0x656
[  251.772974]  [<c10c8299>] ? nfsd+0xde/0x11a
[  251.772974]  [<c10c81bb>] ? nfsd+0x0/0x11a
[  251.772974]  [<c10317c8>] ? kthread+0x42/0x67
[  251.772974]  [<c1031786>] ? kthread+0x0/0x67
[  251.772974]  [<c100320f>] ? kernel_thread_helper+0x7/0x10
[  251.772974] Code: a0 ef ff ff 5b 5e eb 02 31 c0 5b 5e c3 55 57 56 53 83 ec 3c 89 c7 89 0c 24 8b 5c 24 50 31 c0 f6 82 4d 01 00 00 02 75 3f 8b 47 58 <8b> 68 04 8b b2 54 01 00 00 85 db 75 1a b9 0e 00 00 00 8d 7c 24
[  251.772974] EIP: [<c113d95c>] inode_has_perm+0x1e/0x62 SS:ESP 0068:f569be2c
[  251.772974] CR2: 0000000000000004
[  251.780353] ---[ end trace df59a076396b4ee7 ]---




_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs
Comment 22 Eric Sandeen 2009-07-15 04:16:31 UTC
Ok, on 2.6.30-6.fc12.x86_64 w/ selinux enabled I can oops it very quickly with the suggested large dd from a client (client is 2.6.27.24-170.2.68.fc10.x86_64)

rpc.nfsd used greatest stack depth: 2752 bytes left
(note: post oops, I have: nfsd used greatest stack depth: 1528 bytes left)

general protection fault: 0000 [#1] SMP 
last sysfs file: /sys/module/lockd/initstate
CPU 1 
Modules linked in: nfsd lockd nfs_acl auth_rpcgss xfs exportfs ipt_MASQUERADE iptable_nat nf_nat bridge stp llc sunrpc ip6t_REJECT nf_conntrack_ipv6 ip6table_filter ip6_tables ipv6 cpufreq_ondemand acpi_cpufreq freq_table dm_multipath kvm i2c_i801 i3000_edac pata_sil680 iTCO_wdt iTCO_vendor_support i2c_core edac_core shpchp tg3 joydev serio_raw pcspkr dcdbas ata_generic pata_acpi [last unloaded: microcode]
Pid: 1530, comm: nfsd Not tainted 2.6.30-6.fc12.x86_64 #1 PowerEdge 860
RIP: 0010:[<ffffffff811ff1d6>]  [<ffffffff811ff1d6>] inode_has_perm+0x3d/0x87
RSP: 0018:ffff88007046fb60  EFLAGS: 00010246
RAX: 6b6b6b6b6b6b6b6b RBX: 0000000000100004 RCX: 0000000000000000
RDX: 0000000000100004 RSI: ffff8800715a92f8 RDI: ffff88007d101c00
RBP: ffff88007046fbb0 R08: 0000000000000000 R09: 0000000000000000
R10: ffff8800785b8210 R11: 0000000000000000 R12: ffff8800715a92f8
R13: ffff88007d101c00 R14: ffff880070176f00 R15: ffff8800715a92f8
FS:  0000000000000000(0000) GS:ffff880004fdb000(0000) knlGS:0000000000000000
CS:  0010 DS: 0018 ES: 0018 CR0: 000000008005003b
CR2: 00007fb7b55b6000 CR3: 0000000071b32000 CR4: 00000000000006e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Process nfsd (pid: 1530, threadinfo ffff88007046e000, task ffff880070470000)
Stack:
 ffff88007046fbb0 0000000000000246 ffffffff81777f58 000000008c63e246
 ffffffff81777f40 000000008c63e246 ffff880070176f00 ffffffff81777f40
 ffff88007046fbd0 000000008c63e246 ffff88007046fbf0 ffffffff81202b60
Call Trace:
 [<ffffffff81202b60>] selinux_dentry_open+0x101/0x122
 [<ffffffff811f50e1>] security_dentry_open+0x29/0x3f
 [<ffffffff811237b7>] __dentry_open+0x16f/0x316
 [<ffffffff811239f2>] dentry_open+0x94/0xaf
 [<ffffffffa02df143>] nfsd_open+0x13c/0x179 [nfsd]
 [<ffffffff814b96ec>] ? trace_hardirqs_on_thunk+0x3a/0x3f
 [<ffffffffa02df4a6>] nfsd_write+0xa2/0x100 [nfsd]
 [<ffffffff81013b50>] ? restore_args+0x0/0x30
 [<ffffffffa02e6ed2>] nfsd3_proc_write+0xe4/0x114 [nfsd]
 [<ffffffffa02d93dc>] nfsd_dispatch+0x112/0x204 [nfsd]
 [<ffffffffa01bbc33>] svc_process+0x438/0x668 [sunrpc]
 [<ffffffffa02d9a16>] nfsd+0x119/0x164 [nfsd]
 [<ffffffffa02d98fd>] ? nfsd+0x0/0x164 [nfsd]
 [<ffffffff81075380>] kthread+0x6d/0xae
 [<ffffffff8101418a>] child_rip+0xa/0x20
 [<ffffffff81013b50>] ? restore_args+0x0/0x30
 [<ffffffff81075313>] ? kthread+0x0/0xae
 [<ffffffff81014180>] ? child_rip+0x0/0x20
Code: 28 00 00 00 48 89 45 f8 31 c0 44 8b 9e 78 04 00 00 49 89 c8 41 81 e3 00 02 00 00 75 42 48 8b 47 68 48 85 c9 4c 8b 96 80 04 00 00 <44> 8b 48 04 75 19 4c 8d 45 b0 b9 12 00 00 00 44 89 d8 4c 89 c7 
RIP  [<ffffffff811ff1d6>] inode_has_perm+0x3d/0x87
 RSP <ffff88007046fb60>
---[ end trace 9ea7c13d71e4426e ]---
Comment 23 Eric Sandeen 2009-07-15 05:21:58 UTC
(In reply to comment #22)
> Ok, on 2.6.30-6.fc12.x86_64 w/ selinux enabled I can oops it very quickly
> with
> the suggested large dd from a client (client is
> 2.6.27.24-170.2.68.fc10.x86_64)
> 

> RIP: 0010:[<ffffffff811ff1d6>]  [<ffffffff811ff1d6>] inode_has_perm+0x3d/0x87

Oh, but that's a different oops than this bug was originally opened for, sorry.

Maybe 2 regressions here.  :(
Comment 24 Eric Sandeen 2009-07-17 16:35:29 UTC
The bug in comment #22 has been fixed upstream, sorry for the noise.  Now back to the radix tree stuff.
Comment 25 Eric Sandeen 2009-07-18 18:28:37 UTC
(In reply to comment #24)
> The bug in comment #22 has been fixed upstream, sorry for the noise.  Now
> back
> to the radix tree stuff.

For completeness:
https://bugzilla.redhat.com/show_bug.cgi?id=494067 is the bug for the other oops I've confused this bug with, fixed by commit 033a666ccb842ab4134fcd0c861d5ba9f5d6bf3a upstream I believe.
Comment 26 Eric Sandeen 2009-07-20 15:53:23 UTC
Can folks who are hitting the nfs+xfs+radix bug attach their configs?

So far hch & I haven't been able to hit this one ourselves...
Comment 27 Nickolay 2009-07-20 19:12:04 UTC
Created attachment 22416 [details]
My 2.6.30 config

OK. It's my 2.6.30 config.
Problem repeated every day.
Comment 28 Michael Guntsche 2009-07-20 19:32:32 UTC
I am running 2.6.30 with xfs shared via NFS too. I do not get a crash/BUG but after some time a "reconnect_path: npd != pd" message. After that the NFS clients can no longer access the exports and I have to restart the server to make them work again.

For testing purposes I applied the patch from Christoph

http://marc.info/?l=linux-xfs&m=124440888111505

With the locking patch applied I am no longer able to reproduce the problem here (tested it for several hours now). I will let it run like this to see if the problem comes up again.
Comment 29 Christoph Hellwig 2009-07-20 20:13:55 UTC
(In reply to comment #28) 
> For testing purposes I applied the patch from Christoph
> 
> http://marc.info/?l=linux-xfs&m=124440888111505
> 
> With the locking patch applied I am no longer able to reproduce the problem
> here (tested it for several hours now). I will let it run like this to see if
> the problem comes up again.

I had one report where the problem did not go away with it, but maybe that the trees were mixed up there?

There was also another problem where one bug was turned into another by this patch which I still need to investigate, it a a lower priority than this one as only one person was able to reproduce it.

Can everyone who sees this patch try the patch above and report the results?
Comment 30 Nickolay 2009-07-21 09:55:56 UTC
I have experiencing the same problem with 2.6.30.

Yesterday, I have upgrade to 2.6.30 with patch from Christoph.
uptime - 8 hours, and there are no messages like "reconnect_path: npd != pd" in the log, all working fine now.

nickolay@colo_128:~$ iostat -x -k 60 /dev/sdb1
Linux 2.6.30.2 (colo_128)       07/21/2009

Device:         rrqm/s   wrqm/s   r/s   w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await  svctm  %util
sdb1              0.00     8.60 29.95 36.51   158.77   330.58    14.73     1.63   24.46   5.14  34.18
Comment 31 Michael Guntsche 2009-07-22 06:03:33 UTC
Just a quick update.

This morning I saw another

reconnect_path: npd != pd

appearing in the logs. I think it happened when one of the clients made a directory listing of an export. I only got one entry though and the client was still able to access the export afterwards, no stale NFS handles etc. So while the patch seems to help it does not seem to fix the problem completely. The uptime of the server was around 7 hours with no real NFS access during that time.
Comment 32 Nickolay 2009-07-23 07:28:32 UTC
root@colo_128:~# date;uptime
Thu Jul 23 11:23:03 MSD 2009
 11:23:03 up 2 days,  6:12,  1 user,  load average: 1.37, 1.52, 1.36

message appeared in the log 16 times:
Jul 22 17:00:56 colo_128 kernel: [129050.188280] reconnect_path: npd != pd
Jul 22 17:00:56 colo_128 kernel: [129050.188280] reconnect_path: npd != pd
Jul 22 17:00:56 colo_128 kernel: [129050.245508] reconnect_path: npd != pd
Jul 22 17:00:56 colo_128 kernel: [129050.245508] reconnect_path: npd != pd
Jul 22 17:00:56 colo_128 kernel: [129050.245527] reconnect_path: npd != pd
Jul 22 17:00:56 colo_128 kernel: [129050.245527] reconnect_path: npd != pd
Jul 22 17:00:56 colo_128 kernel: [129050.246057] reconnect_path: npd != pd
Jul 22 17:00:56 colo_128 kernel: [129050.246057] reconnect_path: npd != pd
Jul 22 17:00:56 colo_128 kernel: [129050.246263] reconnect_path: npd != pd
Jul 22 17:00:56 colo_128 kernel: [129050.246263] reconnect_path: npd != pd
Jul 22 17:00:56 colo_128 kernel: [129050.257988] reconnect_path: npd != pd
Jul 22 17:00:56 colo_128 kernel: [129050.257988] reconnect_path: npd != pd
Jul 22 17:00:56 colo_128 kernel: [129050.258497] reconnect_path: npd != pd
Jul 22 17:00:56 colo_128 kernel: [129050.258497] reconnect_path: npd != pd
Jul 22 17:00:56 colo_128 kernel: [129050.258787] reconnect_path: npd != pd
Jul 22 17:00:56 colo_128 kernel: [129050.258787] reconnect_path: npd != pd

NFS clients are OK, no messages like "nfs: server x.x.x.x not responding...".
Comment 33 Michael Guntsche 2009-07-23 07:33:59 UTC
Some more information.

After getting the npd != pd the client can not access the export for a period of time. Most of the time access is again possible after a few  minutes. Worst case ist still a restart of the Server. For now I reverted back to a 2.6.28 kernel on this machine.
Comment 34 Nickolay 2009-07-23 14:58:25 UTC
(In reply to comment #33)
> Some more information.
> 
> After getting the npd != pd the client can not access the export for a period
> of time. Most of the time access is again possible after a few  minutes.
> Worst
> case ist still a restart of the Server. For now I reverted back to a 2.6.28
> kernel on this machine.

With 2.6.28 I have very poor XFS perfomance(as if with raid controller cache disabled) :(
I/O wait time have always been 100%.
Comment 35 Andrew Randrianasulu 2009-07-25 07:18:14 UTC
bug still here in 2.6.31-rc4....

----


Jul 25 07:34:29 (none) kernel: reconnect_path: npd != pd
Jul 25 08:16:22 (none) kernel: ------------[ cut here ]------------
Jul 25 08:16:22 (none) kernel: kernel BUG at lib/radix-tree.c:464!
Jul 25 08:16:22 (none) kernel: invalid opcode: 0000 [#1] SMP DEBUG_PAGEALLOC
Jul 25 08:16:22 (none) kernel: last sysfs file: /sys/devices/platform/w83627hf.656/in8_input
Jul 25 08:16:22 (none) kernel: Modules linked in: snd_seq_dummy snd_seq_oss snd_seq_midi_event snd_seq snd_pcm_oss snd_m
ixer_oss sg sd_mod ipv6 nfsd lockd nfs_acl auth_rpcgss sunrpc usb_storage scsi_mod usb_libusual tuner_simple tuner_types
 tda9887 tda8290 snd_cs4236 ppdev snd_mpu401 snd_emu10k1 snd_via82xx saa7134 snd_ac97_codec snd_wavefront snd_wss_lib ac
97_bus uhci_hcd ir_common ehci_hcd snd_pcm videobuf_dma_sg snd_opl3_lib videobuf_core snd_timer tveeprom snd_mpu401_uart
 snd_page_alloc snd_util_mem usbcore shpchp emu10k1_gp pci_hotplug snd_rawmidi 8139too snd_seq_device snd_hwdep snd via_
agp rtc_cmos rtc_core rtc_lib floppy ns558 parport_pc gameport soundcore parport xfs exportfs ufs agpgart tuner v4l2_com
mon videodev v4l1_compat w83627hf hwmon_vid hwmon i2c_viapro i2c_dev fbcon tileblit font bitblit softcursor fb
Jul 25 08:16:22 (none) kernel:
Jul 25 08:16:22 (none) kernel: Pid: 227, comm: kswapd0 Not tainted (2.6.31-rc4-i486 #30) MS-6380E
Jul 25 08:16:22 (none) kernel: EIP: 0060:[<c1148d93>] EFLAGS: 00010246 CPU: 0
Jul 25 08:16:22 (none) kernel: EIP is at radix_tree_tag_set+0x6e/0xa8
Jul 25 08:16:22 (none) kernel: EAX: c5c39360 EBX: 00000024 ECX: 00000024 EDX: 00000000
Jul 25 08:16:22 (none) kernel: ESI: 00000000 EDI: 00000001 EBP: ce962e40 ESP: ce962e20
Jul 25 08:16:22 (none) kernel:  DS: 007b ES: 007b FS: 00d8 GS: 0000 SS: 0068
Jul 25 08:16:22 (none) kernel: Process kswapd0 (pid: 227, ti=ce962000 task=ce9968b0 task.ti=ce962000)
Jul 25 08:16:22 (none) kernel: Stack:
Jul 25 08:16:22 (none) kernel:  00000000 01e1dd64 cea7d498 00000000 00000000 c6a44000 c9aa8d80 c9aa8ecc
Jul 25 08:16:22 (none) kernel: <0> ce962e5c d0b44b62 cea7d474 cea7d420 c9aa8f6c c9aa8d80 0000007f ce962e6c
Jul 25 08:16:22 (none) kernel: <0> d0b38013 c9aa8f6c c9aa8f74 ce962e78 d0b435ce c9aa8f6c ce962e84 c10ccfe0
Jul 25 08:16:22 (none) kernel: Call Trace:
Jul 25 08:16:22 (none) kernel:  [<d0b44b62>] ? xfs_inode_set_reclaim_tag+0x7a/0x98 [xfs]
Jul 25 08:16:22 (none) kernel:  [<d0b38013>] ? xfs_reclaim+0x8e/0x94 [xfs]
Jul 25 08:16:22 (none) kernel:  [<d0b435ce>] ? xfs_fs_destroy_inode+0x2f/0x48 [xfs]
Jul 25 08:16:22 (none) kernel:  [<c10ccfe0>] ? destroy_inode+0x79/0x8a
Jul 25 08:16:22 (none) kernel:  [<c10cd098>] ? dispose_list+0xa7/0xd4
Jul 25 08:16:22 (none) kernel:  [<c10cd270>] ? shrink_icache_memory+0x1ab/0x1d8
Jul 25 08:16:22 (none) kernel:  [<c109a960>] ? shrink_slab+0xda/0x134
Jul 25 08:16:22 (none) kernel:  [<c109add0>] ? kswapd+0x416/0x58b
Jul 25 08:16:22 (none) kernel:  [<c1099363>] ? isolate_pages_global+0x0/0x1d0
Jul 25 08:16:22 (none) kernel:  [<c1046c20>] ? autoremove_wake_function+0x0/0x38
Jul 25 08:16:22 (none) kernel:  [<c1025911>] ? complete+0x39/0x43
Jul 25 08:16:22 (none) kernel:  [<c109a9ba>] ? kswapd+0x0/0x58b
Jul 25 08:16:22 (none) kernel:  [<c10469b6>] ? kthread+0x6e/0x73
Jul 25 08:16:22 (none) kernel:  [<c1046948>] ? kthread+0x0/0x73
Jul 25 08:16:22 (none) kernel:  [<c1003807>] ? kernel_thread_helper+0x7/0x10
Jul 25 08:16:22 (none) kernel: Code: 8b 5d e4 8b 55 e0 d3 eb 83 e3 3f 89 d9 e8 5f fc ff ff 85 c0 75 0d 8b 55 f0 8d 84 16
 10 01 00 00 0f ab 18 8b 74 9e 10 85 f6 75 04 <0f> 0b eb fe 83 6d ec 06 4f 85 ff 75 c3 85 f6 74 1f 8b 4d e8 b8
Jul 25 08:16:22 (none) kernel: EIP: [<c1148d93>] radix_tree_tag_set+0x6e/0xa8 SS:ESP 0068:ce962e20
Jul 25 08:16:22 (none) kernel: ---[ end trace de58457be1c74623 ]---


in addition i have this trace a bit earlier:

---

l 25 07:22:46 (none) kernel: =================================
Jul 25 07:22:46 (none) kernel: [ INFO: inconsistent lock state ]
Jul 25 07:22:46 (none) kernel: 2.6.31-rc4-i486 #30
Jul 25 07:22:46 (none) kernel: ---------------------------------
Jul 25 07:22:46 (none) kernel: inconsistent {RECLAIM_FS-ON-R} -> {IN-RECLAIM_FS-W} usage.
Jul 25 07:22:46 (none) kernel: kswapd0/227 [HC0[0]:SC0[0]:HE1:SE1] takes:
Jul 25 07:22:46 (none) kernel:  (&(&ip->i_lock)->mr_lock){++++-+}, at: [<d0b214b3>] xfs_ilock+0x54/0x6c [xfs]
Jul 25 07:22:46 (none) kernel: {RECLAIM_FS-ON-R} state was registered at:
Jul 25 07:22:46 (none) kernel:   [<c1055c64>] mark_held_locks+0x47/0x5f
Jul 25 07:22:46 (none) kernel:   [<c1055d14>] lockdep_trace_alloc+0x98/0xae
Jul 25 07:22:46 (none) kernel:   [<c10b58ef>] kmem_cache_alloc+0x29/0x128
Jul 25 07:22:46 (none) kernel:   [<d0b3b5eb>] kmem_zone_alloc+0x53/0x93 [xfs]
Jul 25 07:22:46 (none) kernel:   [<d0b13485>] xfs_da_buf_make+0x26/0x123 [xfs]
Jul 25 07:22:46 (none) kernel:   [<d0b13a39>] xfs_da_do_buf+0x4b7/0x5de [xfs]
Jul 25 07:22:46 (none) kernel:   [<d0b13bc5>] xfs_da_read_buf+0x1d/0x22 [xfs]
Jul 25 07:22:46 (none) kernel:   [<d0b16eec>] xfs_dir2_block_lookup_int+0x3e/0x17b [xfs]
Jul 25 07:22:46 (none) kernel:   [<d0b17467>] xfs_dir2_block_lookup+0x1b/0x98 [xfs]
Jul 25 07:22:46 (none) kernel:   [<d0b1624a>] xfs_dir_lookup+0xa8/0x10f [xfs]
Jul 25 07:22:46 (none) kernel:   [<d0b394e0>] xfs_lookup+0x42/0x99 [xfs]
Jul 25 07:22:46 (none) kernel:   [<d0b4151c>] xfs_vn_lookup+0x3d/0x75 [xfs]
Jul 25 07:22:46 (none) kernel:   [<c10c403c>] do_lookup+0xb9/0x129
Jul 25 07:22:46 (none) kernel:   [<c10c4a70>] __link_path_walk+0x5a7/0x6d0
Jul 25 07:22:46 (none) kernel:   [<c10c4d64>] path_walk+0x55/0xaa
Jul 25 07:22:46 (none) kernel:   [<c10c50f8>] do_path_lookup+0x26/0x74
Jul 25 07:22:46 (none) kernel:   [<c10c6f3f>] user_path_at+0x41/0x67
Jul 25 07:22:46 (none) kernel:   [<c10c024a>] vfs_fstatat+0x32/0x59
Jul 25 07:22:46 (none) kernel:   [<c10c02bf>] vfs_lstat+0x1b/0x1d
Jul 25 07:22:46 (none) kernel:   [<c10c02da>] sys_lstat64+0x19/0x2d
Jul 25 07:22:46 (none) kernel:   [<c1002bc5>] syscall_call+0x7/0xb
Jul 25 07:22:46 (none) kernel:   [<ffffffff>] 0xffffffff
Jul 25 07:22:46 (none) kernel: irq event stamp: 484081
Jul 25 07:22:46 (none) kernel: hardirqs last  enabled at (484081): [<c10b6dba>] kmem_cache_free+0xad/0xdd
Jul 25 07:22:46 (none) kernel: hardirqs last disabled at (484080): [<c10b6d49>] kmem_cache_free+0x3c/0xdd
Jul 25 07:22:46 (none) kernel: softirqs last  enabled at (483284): [<c10392b8>] __do_softirq+0x15d/0x165
Jul 25 07:22:46 (none) kernel: softirqs last disabled at (483271): [<c1005240>] do_softirq+0x63/0xb3
Jul 25 07:22:46 (none) kernel:
Jul 25 07:22:46 (none) kernel: other info that might help us debug this:
Jul 25 07:22:46 (none) kernel: 2 locks held by kswapd0/227:
Jul 25 07:22:46 (none) kernel:  #0:  (shrinker_rwsem){++++..}, at: [<c109a8ae>] shrink_slab+0x28/0x134
Jul 25 07:22:46 (none) kernel:  #1:  (iprune_mutex){+.+.-.}, at: [<c10cd102>] shrink_icache_memory+0x3d/0x1d8
Jul 25 07:22:46 (none) kernel:
Jul 25 07:22:46 (none) kernel: stack backtrace:
Jul 25 07:22:46 (none) kernel: Pid: 227, comm: kswapd0 Not tainted 2.6.31-rc4-i486 #30
Jul 25 07:22:46 (none) kernel: Call Trace:
Jul 25 07:22:46 (none) kernel:  [<c127f111>] ? printk+0x14/0x1b
Jul 25 07:22:46 (none) kernel:  [<c1055a20>] valid_state+0x12a/0x13d
Jul 25 07:22:46 (none) kernel:  [<c1055b5b>] mark_lock+0x128/0x1ea
Jul 25 07:22:46 (none) kernel:  [<c10561bd>] ? check_usage_forwards+0x0/0x3f
Jul 25 07:22:46 (none) kernel:  [<c10569f2>] __lock_acquire+0x2fc/0xaf2
Jul 25 07:22:46 (none) kernel:  [<c10571d9>] ? __lock_acquire+0xae3/0xaf2
Jul 25 07:22:46 (none) kernel:  [<c10e143c>] ? fsnotify_clear_marks_by_inode+0x23/0xaa
Jul 25 07:22:46 (none) kernel:  [<c1057278>] lock_acquire+0x90/0xad
Jul 25 07:22:46 (none) kernel:  [<d0b214b3>] ? xfs_ilock+0x54/0x6c [xfs]
Jul 25 07:22:46 (none) kernel:  [<c104a1e2>] down_write_nested+0x3a/0x76
Jul 25 07:22:46 (none) kernel:  [<d0b214b3>] ? xfs_ilock+0x54/0x6c [xfs]
Jul 25 07:22:46 (none) kernel:  [<d0b214b3>] xfs_ilock+0x54/0x6c [xfs]
Jul 25 07:22:46 (none) kernel:  [<d0b37fd2>] xfs_reclaim+0x4d/0x94 [xfs]
Jul 25 07:22:46 (none) kernel:  [<d0b435ce>] xfs_fs_destroy_inode+0x2f/0x48 [xfs]
Jul 25 07:22:46 (none) kernel:  [<c10ccfe0>] destroy_inode+0x79/0x8a
Jul 25 07:22:46 (none) kernel:  [<c10cd098>] dispose_list+0xa7/0xd4
Jul 25 07:22:46 (none) kernel:  [<c10cd270>] shrink_icache_memory+0x1ab/0x1d8
Jul 25 07:22:46 (none) kernel:  [<c109a960>] shrink_slab+0xda/0x134
Jul 25 07:22:46 (none) kernel:  [<c109add0>] kswapd+0x416/0x58b
Jul 25 07:22:46 (none) kernel:  [<c1099363>] ? isolate_pages_global+0x0/0x1d0
Jul 25 07:22:46 (none) kernel:  [<c1046c20>] ? autoremove_wake_function+0x0/0x38
Jul 25 07:22:46 (none) kernel:  [<c1025911>] ? complete+0x39/0x43
Jul 25 07:22:46 (none) kernel:  [<c109a9ba>] ? kswapd+0x0/0x58b
Jul 25 07:22:46 (none) kernel:  [<c10469b6>] kthread+0x6e/0x73
Jul 25 07:22:46 (none) kernel:  [<c1046948>] ? kthread+0x0/0x73
Jul 25 07:22:46 (none) kernel:  [<c1003807>] kernel_thread_helper+0x7/0x10


should i open new bug for this one?
Comment 36 Nickolay 2009-07-25 14:50:27 UTC
Created attachment 22489 [details]
call trace screen

Kernel Panic with 2.6.30 with patch from Christoph.
Seems that patch didn't resolve the problem.
Comment 37 Christoph Hellwig 2009-08-03 00:32:59 UTC
Created attachment 22581 [details]
Various iget fixes

I've found various issued with the way we deal with error in the iget code.  The patch below is a rollup of various patches in that area and some additional debug checks.  This should fix the xfs_fs_destroy_inode issues people see, and there is some way it could also fix the npd != pd messages, although I'm not entirely confident about that.

Note that the rollup includes the previous iget locking fixes some people on here already tested.
Comment 38 Andrew Randrianasulu 2009-08-03 09:14:36 UTC
Probably i saved it in the wrong way, but ....

git apply -v various_iget_fixes.diff 
Checking patch fs/xfs/xfs_iget.c...
error: while searching for:
        xfs_qm_dqdetach(ip);
        xfs_iunlock(ip, XFS_ILOCK_EXCL);

        switch (ip->i_d.di_mode & S_IFMT) {
        case S_IFREG:
        case S_IFDIR:
        case S_IFLNK:
                xfs_idestroy_fork(ip, XFS_DATA_FORK);
                break;
        }

        if (ip->i_afp)
                xfs_idestroy_fork(ip, XF
error: patch failed: fs/xfs/xfs_iget.c:502
error: fs/xfs/xfs_iget.c: patch does not apply
Checking patch fs/xfs/linux-2.6/xfs_sync.c...
Checking patch fs/xfs/linux-2.6/xfs_sync.h...
Checking patch fs/inode.c...
Checking patch include/linux/fs.h...
Checking patch fs/xfs/xfs_inode.h...
Checking patch fs/exportfs/Makefile...
Checking patch fs/exportfs/expfs.c...

i have Linus' tree at commit  a33a052f19a21d727847391c8c1aff3fb221c472 ( Merge branch 'for-linus' of git://neil.brown.name/md)
Comment 39 Christoph Hellwig 2009-08-03 11:50:40 UTC
(In reply to comment #38)
> Probably i saved it in the wrong way, but ....
> 
> git apply -v various_iget_fixes.diff 
> Checking patch fs/xfs/xfs_iget.c...
> error: while searching for:
>         xfs_qm_dqdetach(ip);
>         xfs_iunlock(ip, XFS_ILOCK_EXCL);
> 
>         switch (ip->i_d.di_mode & S_IFMT) {
>         case S_IFREG:
>         case S_IFDIR:
>         case S_IFLNK:
>                 xfs_idestroy_fork(ip, XFS_DATA_FORK);
>                 break;
>         }
> 
>         if (ip->i_afp)
>                 xfs_idestroy_fork(ip, XF

I had another patch touching this area so the patch only applied with fuzz.  It looks like git doesn't like that.  Either apply the patch manually using patch or quilt or use the refreshed one I'll attach now.
Comment 40 Christoph Hellwig 2009-08-03 11:53:36 UTC
Created attachment 22584 [details]
Iget fixes, refreshed

A version of the iget fixes rollup that applies against current mainline without fuzz.
Comment 41 Christoph Hellwig 2009-08-03 18:11:51 UTC
Created attachment 22590 [details]
Iget fixes rollup for Linux 2.6.30

This is a version of the above patch that applies against 2.6.30.
Comment 42 Andrew Randrianasulu 2009-08-03 21:03:29 UTC
Root device is (22, 1)
Setup is 12300 bytes (padded to 12800 bytes).
System is 2252 kB
CRC ba14472
Kernel: arch/x86/boot/bzImage is ready  (#32)
  Building modules, stage 2.
  MODPOST 1582 modules
ERROR: "__destroy_inode" [fs/xfs/xfs.ko] undefined!

----

should I run make clean before ?  (already tried to delete all *.cmd, *.o, *.ko from fs/, fs/xfs, and fs/exportfs - same error)
Comment 43 Christoph Hellwig 2009-08-03 21:06:48 UTC
Created attachment 22596 [details]
export the right symbol

Sorry, the for 2.6.31-rc version of the patch had a problem if XFS is used as a module as it exported the wrong symbol.  Apply this trivial patch to fix it up.  The for-2.6.30 version already has this fixed.
Comment 44 Dan Carley 2009-08-07 14:01:21 UTC
Christoph's 2.6.30 patch in attachment 22590 [details] has fixed the issues of "npd != pd" messages and stale NFS file handles for us. Has been stable for 4 days now. Thanks.
Comment 45 Christoph Hellwig 2009-08-10 02:15:46 UTC
The fixes are now in Linus tree.  If you're using 2.6.30 use the rollup patch in this bugzilla.  I will probably send it to -stable once it has settled in Linus tree.
Comment 46 Rafael J. Wysocki 2009-08-10 13:54:58 UTC
Closing, since the fixes are in the mainline now.
Comment 47 Rafiu Fakunle 2009-08-14 09:35:57 UTC
Are there any plans for this fix to be backported to 2.6.29.x?
Comment 48 Christoph Hellwig 2009-08-14 16:49:11 UTC
Yes, I will submit these to 2.6.29-stable and 2.6.30-stable next week after doing some QA on the backports.
Comment 49 Rafiu Fakunle 2009-08-14 18:46:29 UTC
(In reply to comment #48)
> Yes, I will submit these to 2.6.29-stable and 2.6.30-stable next week after
> doing some QA on the backports.

Great, thanks!
Comment 50 Christoph Hellwig 2009-08-16 02:42:36 UTC
Created attachment 22741 [details]
2.6.30-stable backport patch series

These are the backported patches I want to submit for 2.6.30-stable.

Please test them if you can, especially with NFS workloads.

Note that the tarball also includes a 4th unrelated patch that I want to submit for -stable.
Comment 51 Christoph Hellwig 2009-08-16 04:05:06 UTC
Created attachment 22742 [details]
2.6.29-stable backport patch series

These are the backported patches I want to submit for 2.6.29-stable.

Please test them if you can, especially with NFS workloads.

Note that the tarball also includes a 4th unrelated patch that I want to submit
for -stable.
Comment 52 Nickolay 2009-09-08 16:36:43 UTC
I update the kernel to 2.6.31-rc9, but it seems that problem is not solved.
Server rebooted in several hours.
/proc/sys/kernel/panic_on_oops and /proc/sys/kernel/panic contain 1, so I couldn't capture the screen. :(
Comment 53 Ferenc Wágner 2012-01-25 21:54:06 UTC
Created attachment 72194 [details]
screenshot

Looks like this bug isn't yet fixed in 3.2, see attached screenshot. I got it when hibernating my laptop. I can't reproduce it easily, I've seen it only once up to now. I'm also attaching my config.
Comment 54 Ferenc Wágner 2012-01-25 21:57:24 UTC
Created attachment 72195 [details]
The config going with attachment 72194 [details]

The config going with attachment 72194 [details].

Note You need to log in before you can comment on or make changes to this bug.