Hello.. we have some problems with our Fileservers currently running kernel 2.6.33 .. Background-Information: We are using a 3ware Raid-Controller 9690SA-8E which has 1-6 Xtore-Enclosures attached with 16 x 1 TB Seagate disks (RAID6) each. The 13TB xfs-mountpoints are exported via NFSv3. We are running this kind of hardware-configuration since 3 years now without any xfs related issues. We have many servers crashing with the same problem. xfs_check/xfs_repair don't report any errors, the RAID6 is ok and no disk reports any smart-related errors. example xfs_info output for one 13 TB filesystem: meta-data=/dev/sdc isize=256 agcount=13, agsize=268435392 blks = sectsz=512 attr=2 data = bsize=4096 blocks=3417931776, imaxpct=5 = sunit=64 swidth=896 blks naming =version 2 bsize=4096 ascii-ci=0 log =internal bsize=4096 blocks=521728, version=2 = sectsz=512 sunit=64 blks, lazy-count=1 realtime =none extsz=4096 blocks=0, rtextents=0 The servers stopped running with NumLock/ScrollLock LEDs blinking with kernel 2.6.32 showing: Kernel panic - not syncing: xfs_fs_destroy_inode: cannot reclaim 0xffff88069b165940 Here is the dump of the "kernel BUG" of our 2.6.33 kernel (error message as found in syslog): Mar 11 05:58:33 oneirogmophobie kernel: Assertion failed: __xfs_iflags_test(ip, XFS_IRECLAIMABLE), file: fs/xfs/linux-2.6/xfs_sync.c, line: 727 Mar 11 05:58:33 oneirogmophobie kernel: ------------[ cut here ]------------ Mar 11 05:58:33 oneirogmophobie kernel: kernel BUG at fs/xfs/support/debug.c:109! Mar 11 05:58:33 oneirogmophobie kernel: invalid opcode: 0000 [#1] PREEMPT SMP Mar 11 05:58:33 oneirogmophobie kernel: last sysfs file: /sys/devices/pci0000:00/0000:00:0f.0/0000:07:00.0/host6/scsi_host/host6/stats Mar 11 05:58:33 oneirogmophobie kernel: CPU 1 Mar 11 05:58:33 oneirogmophobie kernel: Pid: 2923, comm: xfssyncd Not tainted 2.6.33.x86_64.42 #2 S2932/S2932-E/S2932-SI/empty Mar 11 05:58:33 oneirogmophobie kernel: RIP: 0010:[<ffffffff81305f6a>] [<ffffffff81305f6a>] assfail+0x1a/0x20 Mar 11 05:58:33 oneirogmophobie kernel: RSP: 0018:ffff8804248dbd50 EFLAGS: 00010286 Mar 11 05:58:33 oneirogmophobie kernel: RAX: 000000000000006b RBX: ffff88027c994840 RCX: 000000000001ffff Mar 11 05:58:33 oneirogmophobie kernel: RDX: ffff880028240000 RSI: 0000000000000046 RDI: ffffffff81d8a414 Mar 11 05:58:33 oneirogmophobie kernel: RBP: ffff8804248dbd50 R08: 0000000000009bab R09: 0000000000000000 Mar 11 05:58:33 oneirogmophobie kernel: R10: 0000000000000000 R11: ffff8806666d5b08 R12: 0000000000000002 Mar 11 05:58:33 oneirogmophobie kernel: R13: ffff88027c994938 R14: ffff8804267b8060 R15: 0000000000000000 Mar 11 05:58:33 oneirogmophobie kernel: FS: 00007f0e4782b6f0(0000) GS:ffff880028240000(0000) knlGS:00000000f765c6c0 Mar 11 05:58:33 oneirogmophobie kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b Mar 11 05:58:33 oneirogmophobie kernel: CR2: 0000000000590b30 CR3: 0000000424089000 CR4: 00000000000006e0 Mar 11 05:58:33 oneirogmophobie kernel: DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 Mar 11 05:58:33 oneirogmophobie kernel: DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 Mar 11 05:58:33 oneirogmophobie kernel: Process xfssyncd (pid: 2923, threadinfo ffff8804248da000, task ffff8804250d3910) Mar 11 05:58:33 oneirogmophobie kernel: Stack: Mar 11 05:58:33 oneirogmophobie kernel: ffff8804248dbd80 ffffffff81304b56 ffff8804267b8060 0000000000000001 Mar 11 05:58:33 oneirogmophobie kernel: <0> ffff8804267b80a8 ffff8804248dbdbc ffff8804248dbdf0 ffffffff81305760 Mar 11 05:58:33 oneirogmophobie kernel: <0> 00000000248dbe10 0000000200000000 ffffffff81304b00 ffff88082508a800 Mar 11 05:58:33 oneirogmophobie kernel: Call Trace: Mar 11 05:58:33 oneirogmophobie kernel: [<ffffffff81304b56>] xfs_reclaim_inode+0x56/0x120 Mar 11 05:58:33 oneirogmophobie kernel: [<ffffffff81305760>] xfs_inode_ag_walk+0x90/0x120 Mar 11 05:58:33 oneirogmophobie kernel: [<ffffffff81304b00>] ? xfs_reclaim_inode+0x0/0x120 Mar 11 05:58:33 oneirogmophobie kernel: [<ffffffff81304b00>] ? xfs_reclaim_inode+0x0/0x120 Mar 11 05:58:33 oneirogmophobie kernel: [<ffffffff81305863>] xfs_inode_ag_iterator+0x73/0xb0 Mar 11 05:58:33 oneirogmophobie kernel: [<ffffffff813058ba>] xfs_reclaim_inodes+0x1a/0x20 Mar 11 05:58:33 oneirogmophobie kernel: [<ffffffff813058ee>] xfs_sync_worker+0x2e/0x70 Mar 11 05:58:33 oneirogmophobie kernel: [<ffffffff81304f29>] xfssyncd+0x169/0x210 Mar 11 05:58:33 oneirogmophobie kernel: [<ffffffff81304dc0>] ? xfssyncd+0x0/0x210 Mar 11 05:58:33 oneirogmophobie kernel: [<ffffffff8105469e>] kthread+0x8e/0xa0 Mar 11 05:58:33 oneirogmophobie kernel: [<ffffffff81003214>] kernel_thread_helper+0x4/0x10 Mar 11 05:58:33 oneirogmophobie kernel: [<ffffffff81054610>] ? kthread+0x0/0xa0 Mar 11 05:58:33 oneirogmophobie kernel: [<ffffffff81003210>] ? kernel_thread_helper+0x0/0x10 Mar 11 05:58:33 oneirogmophobie kernel: Code: 81 c7 44 24 08 01 00 00 00 e8 d3 24 04 00 c9 c3 90 55 89 d1 31 c0 48 89 f2 48 89 fe 48 c7 c7 38 29 ae 81 48 89 e5 e8 e1 a2 52 00 <0f> 0b eb fe 66 90 55 48 89 e5 48 83 ec 30 48 89 5d d8 4c 89 75 Mar 11 05:58:33 oneirogmophobie kernel: RIP [<ffffffff81305f6a>] assfail+0x1a/0x20 Mar 11 05:58:33 oneirogmophobie kernel: RSP <ffff8804248dbd50> Mar 11 05:58:33 oneirogmophobie kernel: ---[ end trace 867387936839e480 ]--- Mar 11 05:58:33 oneirogmophobie kernel: note: xfssyncd[2923] exited with preempt_count 2 what other information should i supply to help fixing this problem. please help 8) regards, Marius Tolzmann
Created attachment 25474 [details] Fix locking for inode cache radix tree tag updates This issue should be fixed by the attached patch from 2.6.34-rc1.
Marius, does the patch work for you?
I compiled the latest kernel (2.6.34-rc1) that included this patch and rebootet the system. Since we don't know how to force a crash we can't say if it is running fine now or not. System uptime is 4d23h without any problems. In the past the systems used to crash every 3 to 14+ days. The patched kernel is currently running on two other systems which used to crash sometimes. So we are currently testing the patch on 3 systems without problems (yet). Do you have any ideas what we can do to produce workloads that may trigger the bug when stress-testing the filesystem? cheers, marius
Hi there, today we found the following messages on one of our servers running the 2.6.34-rc1 kernel: Mar 18 16:54:17 bugblatterbeastoftraal kernel: Filesystem "sde": inode 0xb1e4b1 background reclaim flush failed with 11 Mar 18 18:23:46 bugblatterbeastoftraal kernel: Filesystem "sde": inode 0xb1e85f background reclaim flush failed with 11 Mar 23 18:48:59 bugblatterbeastoftraal kernel: Filesystem "sde": inode 0x82cc59 background reclaim flush failed with 11 the server did not crash this time. the problem was detected, because we have NFS Stale Filehandles on a nfs-client when accessing this exported disk. root@bugblatterbeastoftraal:/# xfs_info /dev/sde meta-data=/dev/sde isize=256 agcount=13, agsize=268435392 blks = sectsz=512 attr=2 data = bsize=4096 blocks=3417931776, imaxpct=5 = sunit=64 swidth=896 blks naming =version 2 bsize=4096 ascii-ci=0 log =internal bsize=4096 blocks=521728, version=2 = sectsz=512 sunit=64 blks, lazy-count=1 realtime =none extsz=4096 blocks=0, rtextents=0 what information will be helpful? regards, marius..
"background reclaim flush failed with 11" is not a problem - it's an EAGAIN error and the background reclaim code will retry it properly on the next scan. The warning is there to catch fatal errors, so we need to make sure that it doesn't trigger on EAGAIN... Cheers, Dave.
The original fix is now in mainline, so closing.