Bug 11653 - Oops resulting from NFSD/XFS interaction?
Summary: Oops resulting from NFSD/XFS interaction?
Status: REJECTED INSUFFICIENT_DATA
Alias: None
Product: File System
Classification: Unclassified
Component: XFS (show other bugs)
Hardware: All Linux
: P1 normal
Assignee: Dave Chinner
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2008-09-26 13:59 UTC by Joshua Hoblitt
Modified: 2008-09-27 18:26 UTC (History)
0 users

See Also:
Kernel Version: 2.6.27-rc5 (netdev-2.6)
Tree: Mainline
Regression: ---


Attachments
.config (46.14 KB, text/plain)
2008-09-26 14:02 UTC, Joshua Hoblitt
Details

Description Joshua Hoblitt 2008-09-26 13:59:25 UTC
[244160.254446] BUG: unable to handle kernel paging request at ffff8804343d3f58
[244160.256210] IP: [<ffffffffa007b111>] xfs_alloc_fix_freelist+0x27/0x415 [xfs]
[244160.256210] PGD 202063 PUD 18067 PMD 0 
[244160.256210] Oops: 0000 [1] SMP 
[244160.256210] CPU 3 
[244160.256210] Modules linked in: k8temp autofs4 i2c_i801 i2c_core iTCO_wdt e1000e tg3 libphy e1000 xfs dm_snapshot dm_mirror dm_log aacraid 3w_9xxx 3w_xxxx atp870u arcmsr aic7xxx scsi_wait_scan
[244160.256210] Pid: 8030, comm: nfsd Not tainted 2.6.27-rc5-22033-gd26acd9-dirty #2
[244160.256210] RIP: 0010:[<ffffffffa007b111>]  [<ffffffffa007b111>] xfs_alloc_fix_freelist+0x27/0x415 [xfs]
[244160.256210] RSP: 0018:ffff8804279c9aa0  EFLAGS: 00010286
[244160.256210] RAX: ffff88042cca9000 RBX: 00010ffffc080943 RCX: ffff88042bdb1648
[244160.256210] RDX: ffff88042cca9000 RSI: 0000000000000002 RDI: ffff8804279c9b80
[244160.256210] RBP: ffff8804279c9b80 R08: 0000000000000001 R09: 0000000000000001
[244160.256210] R10: 0000000000000000 R11: ffffffff805c7894 R12: 0000000000000002
[244160.256210] R13: ffff8802cf5a7b60 R14: 0000000000000000 R15: ffff8804343d3f58
[244160.256210] FS:  0000000000000000(0000) GS:ffff88042e444200(0000) knlGS:0000000000000000
[244160.256210] CS:  0010 DS: 0018 ES: 0018 CR0: 000000008005003b
[244160.256210] CR2: ffff8804343d3f58 CR3: 00000004139d0000 CR4: 00000000000006e0
[244160.256210] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[244160.256210] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
[244160.256210] Process nfsd (pid: 8030, threadinfo ffff8804279c8000, task ffff88042bdb0f10)
[244160.256210] Stack:  0000000000000018 0000000000000046 ffff88042cca9000 0000000000000000
[244160.256210]  ffff88042bdb0f10 0000000000000046 ffff88042cca9330 0000000000000000
[244160.256210]  0000000000000004 0000000000000000 ffff8804279c9cc0 0000000000000046
[244160.256210] Call Trace:
[244160.256210]  [<ffffffff805c7f14>] ? _spin_unlock_irq+0x1f/0x22
[244160.256210]  [<ffffffff805c78b3>] ? __down_read+0x34/0x9e
[244160.256210]  [<ffffffffa007b58a>] ? xfs_free_extent+0x8b/0xcc [xfs]
[244160.256210]  [<ffffffffa00840b9>] ? xfs_bmap_finish+0xee/0x15f [xfs]
[244160.256210]  [<ffffffffa00a3d8a>] ? xfs_itruncate_finish+0x190/0x2ba [xfs]
[244160.256210]  [<ffffffffa00bc0f3>] ? xfs_inactive+0x1e1/0x412 [xfs]
[244160.256210]  [<ffffffffa00c6a0e>] ? xfs_fs_clear_inode+0xb5/0xf7 [xfs]
[244160.256210]  [<ffffffff802b78a1>] ? clear_inode+0x75/0xcc
[244160.256210]  [<ffffffff802b7a10>] ? generic_delete_inode+0xd1/0x134
[244160.256210]  [<ffffffff802b6ce1>] ? d_delete+0x4a/0xc1
[244160.256210]  [<ffffffff802ad291>] ? vfs_unlink+0xea/0x109
[244160.256210]  [<ffffffff80372d13>] ? nfsd_unlink+0x1ee/0x26c
[244160.256210]  [<ffffffff8037aab0>] ? nfsd3_proc_remove+0x9d/0xaa
[244160.256210]  [<ffffffff8036f158>] ? nfsd_dispatch+0xde/0x1c2
[244160.256210]  [<ffffffff805a22ee>] ? svc_process+0x408/0x6ea
[244160.256210]  [<ffffffff805c78b3>] ? __down_read+0x34/0x9e
[244160.256210]  [<ffffffff8036f7a1>] ? nfsd+0x1bf/0x296
[244160.256210]  [<ffffffff8036f5e2>] ? nfsd+0x0/0x296
[244160.256210]  [<ffffffff80247805>] ? kthread+0x47/0x76
[244160.256210]  [<ffffffff80230223>] ? schedule_tail+0x27/0x5f
[244160.256210]  [<ffffffff8020ce09>] ? child_rip+0xa/0x11
[244160.256210]  [<ffffffff8024767c>] ? kthreadd+0x167/0x18c
[244160.256210]  [<ffffffff802477be>] ? kthread+0x0/0x76
[244160.256210]  [<ffffffff8020cdff>] ? child_rip+0x0/0x11
[244160.256210] 
[244160.256210] 
[244160.256210] Code: 5d 41 5c c3 41 57 41 56 41 55 41 54 41 89 f4 55 48 89 fd 53 48 81 ec a8 00 00 00 48 8b 47 08 48 89 44 24 10 4c 8b 7f 18 4c 8b 2f <41> 80 3f 00 75 2d 8b 57 28 4c 8d 84 24 90 00 00 00 89 f1 48 89 
[244160.256210] RIP  [<ffffffffa007b111>] xfs_alloc_fix_freelist+0x27/0x415 [xfs]
[244160.256210]  RSP <ffff8804279c9aa0>
[244160.256210] CR2: ffff8804343d3f58
[244160.256210] ---[ end trace ce07a23d948faa80 ]---
Comment 1 Joshua Hoblitt 2008-09-26 14:02:15 UTC
Created attachment 18062 [details]
.config
Comment 2 Dave Chinner 2008-09-26 18:24:37 UTC
IIRC this one is caused by a corrupted block pointer not being bounds-checked
correctly. Hence we end up with a index into an array that is wildly off.

Can you run xfs_check on the filesystem to see if there is a corrupted
btree block somewhere in the filesystem? 
Comment 3 Joshua Hoblitt 2008-09-26 19:09:04 UTC
The volume would not unmount to fsck even though lsof said there were no open file descriptors on it -- perhaps this was NFS.  In any event, the system wouldn't shutdown nice either and I had to power cycle it.  After coming back up
xfs_check would not run on the volume. xfs_repair would run with -L and found (and fixed) an inode that had an extent allocated beyond the size of the volume or something.  Sadly the shell got closed before i could cut and past the error.

Still -- this seems like a poor failure mode.  Is it an XFS error handling problem, a problem with NFS preventing the system from unmounting the FS or error, or some combo of the two?
Comment 4 Dave Chinner 2008-09-27 18:26:09 UTC
Joshua,

I might as well close this bug now, what you've done will have removed
any trace of the problem that caused the filesystem to shut down.

In future:

    - the NFS server holds references to the filesystem; you need to
      unexport it or shut down the NFS server to be able to unmount it.
    - xfs_check or xfs_repair can't correctly until the journal has been
      replayed. Hence after unmounting, you need to remount it to get
      journal replay to occur, then unmount it again and run xfs_check
    - xfs_repair -L is _dangerous_. It will cause transactions that can
      be safely replayed to be tossed away, and guarantees that xfs_repair
      finds inconsistencies in the filesystem. This typically hides whatever
      problem caused the shutdown - it is a repair method that should only
      be used when the journal is corrupted.

If you don't know how to handle the failure properly - ask us. We can
tell you exactly what you need to do to quickly recover the filesystem
and to do so in a manner that is also helpful to us in tracking down
the bug that caused the shutdown.

That being said, if it was a corrupted inode btree block, then we've probably
just fixed the last known occurrence of this and it should be available in 2.6.27-rc8....

Note You need to log in before you can comment on or make changes to this bug.