[244160.254446] BUG: unable to handle kernel paging request at ffff8804343d3f58 [244160.256210] IP: [<ffffffffa007b111>] xfs_alloc_fix_freelist+0x27/0x415 [xfs] [244160.256210] PGD 202063 PUD 18067 PMD 0 [244160.256210] Oops: 0000 [1] SMP [244160.256210] CPU 3 [244160.256210] Modules linked in: k8temp autofs4 i2c_i801 i2c_core iTCO_wdt e1000e tg3 libphy e1000 xfs dm_snapshot dm_mirror dm_log aacraid 3w_9xxx 3w_xxxx atp870u arcmsr aic7xxx scsi_wait_scan [244160.256210] Pid: 8030, comm: nfsd Not tainted 2.6.27-rc5-22033-gd26acd9-dirty #2 [244160.256210] RIP: 0010:[<ffffffffa007b111>] [<ffffffffa007b111>] xfs_alloc_fix_freelist+0x27/0x415 [xfs] [244160.256210] RSP: 0018:ffff8804279c9aa0 EFLAGS: 00010286 [244160.256210] RAX: ffff88042cca9000 RBX: 00010ffffc080943 RCX: ffff88042bdb1648 [244160.256210] RDX: ffff88042cca9000 RSI: 0000000000000002 RDI: ffff8804279c9b80 [244160.256210] RBP: ffff8804279c9b80 R08: 0000000000000001 R09: 0000000000000001 [244160.256210] R10: 0000000000000000 R11: ffffffff805c7894 R12: 0000000000000002 [244160.256210] R13: ffff8802cf5a7b60 R14: 0000000000000000 R15: ffff8804343d3f58 [244160.256210] FS: 0000000000000000(0000) GS:ffff88042e444200(0000) knlGS:0000000000000000 [244160.256210] CS: 0010 DS: 0018 ES: 0018 CR0: 000000008005003b [244160.256210] CR2: ffff8804343d3f58 CR3: 00000004139d0000 CR4: 00000000000006e0 [244160.256210] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [244160.256210] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 [244160.256210] Process nfsd (pid: 8030, threadinfo ffff8804279c8000, task ffff88042bdb0f10) [244160.256210] Stack: 0000000000000018 0000000000000046 ffff88042cca9000 0000000000000000 [244160.256210] ffff88042bdb0f10 0000000000000046 ffff88042cca9330 0000000000000000 [244160.256210] 0000000000000004 0000000000000000 ffff8804279c9cc0 0000000000000046 [244160.256210] Call Trace: [244160.256210] [<ffffffff805c7f14>] ? _spin_unlock_irq+0x1f/0x22 [244160.256210] [<ffffffff805c78b3>] ? __down_read+0x34/0x9e [244160.256210] [<ffffffffa007b58a>] ? xfs_free_extent+0x8b/0xcc [xfs] [244160.256210] [<ffffffffa00840b9>] ? xfs_bmap_finish+0xee/0x15f [xfs] [244160.256210] [<ffffffffa00a3d8a>] ? xfs_itruncate_finish+0x190/0x2ba [xfs] [244160.256210] [<ffffffffa00bc0f3>] ? xfs_inactive+0x1e1/0x412 [xfs] [244160.256210] [<ffffffffa00c6a0e>] ? xfs_fs_clear_inode+0xb5/0xf7 [xfs] [244160.256210] [<ffffffff802b78a1>] ? clear_inode+0x75/0xcc [244160.256210] [<ffffffff802b7a10>] ? generic_delete_inode+0xd1/0x134 [244160.256210] [<ffffffff802b6ce1>] ? d_delete+0x4a/0xc1 [244160.256210] [<ffffffff802ad291>] ? vfs_unlink+0xea/0x109 [244160.256210] [<ffffffff80372d13>] ? nfsd_unlink+0x1ee/0x26c [244160.256210] [<ffffffff8037aab0>] ? nfsd3_proc_remove+0x9d/0xaa [244160.256210] [<ffffffff8036f158>] ? nfsd_dispatch+0xde/0x1c2 [244160.256210] [<ffffffff805a22ee>] ? svc_process+0x408/0x6ea [244160.256210] [<ffffffff805c78b3>] ? __down_read+0x34/0x9e [244160.256210] [<ffffffff8036f7a1>] ? nfsd+0x1bf/0x296 [244160.256210] [<ffffffff8036f5e2>] ? nfsd+0x0/0x296 [244160.256210] [<ffffffff80247805>] ? kthread+0x47/0x76 [244160.256210] [<ffffffff80230223>] ? schedule_tail+0x27/0x5f [244160.256210] [<ffffffff8020ce09>] ? child_rip+0xa/0x11 [244160.256210] [<ffffffff8024767c>] ? kthreadd+0x167/0x18c [244160.256210] [<ffffffff802477be>] ? kthread+0x0/0x76 [244160.256210] [<ffffffff8020cdff>] ? child_rip+0x0/0x11 [244160.256210] [244160.256210] [244160.256210] Code: 5d 41 5c c3 41 57 41 56 41 55 41 54 41 89 f4 55 48 89 fd 53 48 81 ec a8 00 00 00 48 8b 47 08 48 89 44 24 10 4c 8b 7f 18 4c 8b 2f <41> 80 3f 00 75 2d 8b 57 28 4c 8d 84 24 90 00 00 00 89 f1 48 89 [244160.256210] RIP [<ffffffffa007b111>] xfs_alloc_fix_freelist+0x27/0x415 [xfs] [244160.256210] RSP <ffff8804279c9aa0> [244160.256210] CR2: ffff8804343d3f58 [244160.256210] ---[ end trace ce07a23d948faa80 ]---
Created attachment 18062 [details] .config
IIRC this one is caused by a corrupted block pointer not being bounds-checked correctly. Hence we end up with a index into an array that is wildly off. Can you run xfs_check on the filesystem to see if there is a corrupted btree block somewhere in the filesystem?
The volume would not unmount to fsck even though lsof said there were no open file descriptors on it -- perhaps this was NFS. In any event, the system wouldn't shutdown nice either and I had to power cycle it. After coming back up xfs_check would not run on the volume. xfs_repair would run with -L and found (and fixed) an inode that had an extent allocated beyond the size of the volume or something. Sadly the shell got closed before i could cut and past the error. Still -- this seems like a poor failure mode. Is it an XFS error handling problem, a problem with NFS preventing the system from unmounting the FS or error, or some combo of the two?
Joshua, I might as well close this bug now, what you've done will have removed any trace of the problem that caused the filesystem to shut down. In future: - the NFS server holds references to the filesystem; you need to unexport it or shut down the NFS server to be able to unmount it. - xfs_check or xfs_repair can't correctly until the journal has been replayed. Hence after unmounting, you need to remount it to get journal replay to occur, then unmount it again and run xfs_check - xfs_repair -L is _dangerous_. It will cause transactions that can be safely replayed to be tossed away, and guarantees that xfs_repair finds inconsistencies in the filesystem. This typically hides whatever problem caused the shutdown - it is a repair method that should only be used when the journal is corrupted. If you don't know how to handle the failure properly - ask us. We can tell you exactly what you need to do to quickly recover the filesystem and to do so in a manner that is also helpful to us in tracking down the bug that caused the shutdown. That being said, if it was a corrupted inode btree block, then we've probably just fixed the last known occurrence of this and it should be available in 2.6.27-rc8....