Bug 15516 - Assertion failed: __xfs_iflags_test(ip, XFS_IRECLAIMABLE)
Summary: Assertion failed: __xfs_iflags_test(ip, XFS_IRECLAIMABLE)
Status: RESOLVED CODE_FIX
Alias: None
Product: File System
Classification: Unclassified
Component: XFS (show other bugs)
Hardware: x86-64 Linux
: P1 high
Assignee: Christoph Hellwig
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2010-03-11 11:05 UTC by Marius Tolzmann
Modified: 2010-04-03 10:31 UTC (History)
3 users (show)

See Also:
Kernel Version: 2.6.33
Tree: Mainline
Regression: No


Attachments
Fix locking for inode cache radix tree tag updates (2.67 KB, patch)
2010-03-11 13:08 UTC, Christoph Hellwig
Details | Diff

Description Marius Tolzmann 2010-03-11 11:05:52 UTC
Hello..

we have some problems with our Fileservers currently running kernel 2.6.33 ..

Background-Information: 
  We are using a 3ware Raid-Controller 9690SA-8E which has 1-6 Xtore-Enclosures 
  attached with 16 x 1 TB Seagate disks (RAID6) each.

  The 13TB xfs-mountpoints are exported via NFSv3.

  We are running this kind of hardware-configuration since 3 years now without 
  any xfs related issues.

  We have many servers crashing with the same problem.


  xfs_check/xfs_repair don't report any errors, the RAID6 is ok and no disk 
  reports any smart-related errors.

example xfs_info output for one 13 TB filesystem:

meta-data=/dev/sdc               isize=256    agcount=13, agsize=268435392 blks
         =                       sectsz=512   attr=2
data     =                       bsize=4096   blocks=3417931776, imaxpct=5
         =                       sunit=64     swidth=896 blks
naming   =version 2              bsize=4096   ascii-ci=0
log      =internal               bsize=4096   blocks=521728, version=2
         =                       sectsz=512   sunit=64 blks, lazy-count=1
realtime =none                   extsz=4096   blocks=0, rtextents=0

The servers stopped running with NumLock/ScrollLock LEDs blinking with kernel 2.6.32 showing:
 
   Kernel panic - not syncing: xfs_fs_destroy_inode: cannot reclaim 
   0xffff88069b165940


Here is the dump of the "kernel BUG" of our 2.6.33 kernel (error message as found in syslog):

Mar 11 05:58:33 oneirogmophobie kernel: Assertion failed: __xfs_iflags_test(ip, XFS_IRECLAIMABLE), file: fs/xfs/linux-2.6/xfs_sync.c, line: 727
Mar 11 05:58:33 oneirogmophobie kernel: ------------[ cut here ]------------
Mar 11 05:58:33 oneirogmophobie kernel: kernel BUG at fs/xfs/support/debug.c:109!
Mar 11 05:58:33 oneirogmophobie kernel: invalid opcode: 0000 [#1] PREEMPT SMP
Mar 11 05:58:33 oneirogmophobie kernel: last sysfs file: /sys/devices/pci0000:00/0000:00:0f.0/0000:07:00.0/host6/scsi_host/host6/stats
Mar 11 05:58:33 oneirogmophobie kernel: CPU 1
Mar 11 05:58:33 oneirogmophobie kernel: Pid: 2923, comm: xfssyncd Not tainted 2.6.33.x86_64.42 #2 S2932/S2932-E/S2932-SI/empty
Mar 11 05:58:33 oneirogmophobie kernel: RIP: 0010:[<ffffffff81305f6a>]  [<ffffffff81305f6a>] assfail+0x1a/0x20
Mar 11 05:58:33 oneirogmophobie kernel: RSP: 0018:ffff8804248dbd50  EFLAGS: 00010286
Mar 11 05:58:33 oneirogmophobie kernel: RAX: 000000000000006b RBX: ffff88027c994840 RCX: 000000000001ffff
Mar 11 05:58:33 oneirogmophobie kernel: RDX: ffff880028240000 RSI: 0000000000000046 RDI: ffffffff81d8a414
Mar 11 05:58:33 oneirogmophobie kernel: RBP: ffff8804248dbd50 R08: 0000000000009bab R09: 0000000000000000
Mar 11 05:58:33 oneirogmophobie kernel: R10: 0000000000000000 R11: ffff8806666d5b08 R12: 0000000000000002
Mar 11 05:58:33 oneirogmophobie kernel: R13: ffff88027c994938 R14: ffff8804267b8060 R15: 0000000000000000
Mar 11 05:58:33 oneirogmophobie kernel: FS:  00007f0e4782b6f0(0000) GS:ffff880028240000(0000) knlGS:00000000f765c6c0
Mar 11 05:58:33 oneirogmophobie kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
Mar 11 05:58:33 oneirogmophobie kernel: CR2: 0000000000590b30 CR3: 0000000424089000 CR4: 00000000000006e0
Mar 11 05:58:33 oneirogmophobie kernel: DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
Mar 11 05:58:33 oneirogmophobie kernel: DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Mar 11 05:58:33 oneirogmophobie kernel: Process xfssyncd (pid: 2923, threadinfo ffff8804248da000, task ffff8804250d3910)
Mar 11 05:58:33 oneirogmophobie kernel: Stack:
Mar 11 05:58:33 oneirogmophobie kernel:  ffff8804248dbd80 ffffffff81304b56 ffff8804267b8060 0000000000000001
Mar 11 05:58:33 oneirogmophobie kernel: <0> ffff8804267b80a8 ffff8804248dbdbc ffff8804248dbdf0 ffffffff81305760
Mar 11 05:58:33 oneirogmophobie kernel: <0> 00000000248dbe10 0000000200000000 ffffffff81304b00 ffff88082508a800
Mar 11 05:58:33 oneirogmophobie kernel: Call Trace:
Mar 11 05:58:33 oneirogmophobie kernel:  [<ffffffff81304b56>] xfs_reclaim_inode+0x56/0x120
Mar 11 05:58:33 oneirogmophobie kernel:  [<ffffffff81305760>] xfs_inode_ag_walk+0x90/0x120
Mar 11 05:58:33 oneirogmophobie kernel:  [<ffffffff81304b00>] ? xfs_reclaim_inode+0x0/0x120
Mar 11 05:58:33 oneirogmophobie kernel:  [<ffffffff81304b00>] ? xfs_reclaim_inode+0x0/0x120
Mar 11 05:58:33 oneirogmophobie kernel:  [<ffffffff81305863>] xfs_inode_ag_iterator+0x73/0xb0
Mar 11 05:58:33 oneirogmophobie kernel:  [<ffffffff813058ba>] xfs_reclaim_inodes+0x1a/0x20
Mar 11 05:58:33 oneirogmophobie kernel:  [<ffffffff813058ee>] xfs_sync_worker+0x2e/0x70
Mar 11 05:58:33 oneirogmophobie kernel:  [<ffffffff81304f29>] xfssyncd+0x169/0x210
Mar 11 05:58:33 oneirogmophobie kernel:  [<ffffffff81304dc0>] ? xfssyncd+0x0/0x210
Mar 11 05:58:33 oneirogmophobie kernel:  [<ffffffff8105469e>] kthread+0x8e/0xa0
Mar 11 05:58:33 oneirogmophobie kernel:  [<ffffffff81003214>] kernel_thread_helper+0x4/0x10
Mar 11 05:58:33 oneirogmophobie kernel:  [<ffffffff81054610>] ? kthread+0x0/0xa0
Mar 11 05:58:33 oneirogmophobie kernel:  [<ffffffff81003210>] ? kernel_thread_helper+0x0/0x10
Mar 11 05:58:33 oneirogmophobie kernel: Code: 81 c7 44 24 08 01 00 00 00 e8 d3 24 04 00 c9 c3 90 55 89 d1 31 c0 48 89 f2 48 89 fe 48 c7 c7 38 29 ae 81 48
89 e5 e8 e1 a2 52 00 <0f> 0b eb fe 66 90 55 48 89 e5 48 83 ec 30 48 89 5d d8 4c 89 75
Mar 11 05:58:33 oneirogmophobie kernel: RIP  [<ffffffff81305f6a>] assfail+0x1a/0x20
Mar 11 05:58:33 oneirogmophobie kernel:  RSP <ffff8804248dbd50>
Mar 11 05:58:33 oneirogmophobie kernel: ---[ end trace 867387936839e480 ]---
Mar 11 05:58:33 oneirogmophobie kernel: note: xfssyncd[2923] exited with preempt_count 2

what other information should i supply to help fixing this problem.

please help 8)

regards,

   Marius Tolzmann
Comment 1 Christoph Hellwig 2010-03-11 13:08:41 UTC
Created attachment 25474 [details]
Fix locking for inode cache radix tree tag updates

This issue should be fixed by the attached patch from 2.6.34-rc1.
Comment 2 Christoph Hellwig 2010-03-16 09:54:44 UTC
Marius, does the patch work for you?
Comment 3 Marius Tolzmann 2010-03-16 13:57:06 UTC
I compiled the latest kernel (2.6.34-rc1) that included this patch and rebootet the system. 

Since we don't know how to force a crash we can't say if it is running fine now or not. 

System uptime is 4d23h without any problems. 

In the past the systems used to crash every 3 to 14+ days.

The patched kernel is currently running on two other systems which used to crash sometimes.

So we are currently testing the patch on 3 systems without problems (yet).

Do you have any ideas what we can do to produce workloads that may trigger the bug when stress-testing the filesystem?

cheers, marius
Comment 4 Marius Tolzmann 2010-03-25 13:30:40 UTC
Hi there,

today we found the following messages on one of our servers running the 2.6.34-rc1 kernel:

Mar 18 16:54:17 bugblatterbeastoftraal kernel: Filesystem "sde": inode 0xb1e4b1 background reclaim flush failed with 11
Mar 18 18:23:46 bugblatterbeastoftraal kernel: Filesystem "sde": inode 0xb1e85f background reclaim flush failed with 11
Mar 23 18:48:59 bugblatterbeastoftraal kernel: Filesystem "sde": inode 0x82cc59 background reclaim flush failed with 11

the server did not crash this time.

the problem was detected, because we have NFS Stale Filehandles on a nfs-client when accessing this exported disk.

root@bugblatterbeastoftraal:/# xfs_info /dev/sde
meta-data=/dev/sde               isize=256    agcount=13, agsize=268435392 blks
         =                       sectsz=512   attr=2
data     =                       bsize=4096   blocks=3417931776, imaxpct=5
         =                       sunit=64     swidth=896 blks
naming   =version 2              bsize=4096   ascii-ci=0
log      =internal               bsize=4096   blocks=521728, version=2
         =                       sectsz=512   sunit=64 blks, lazy-count=1
realtime =none                   extsz=4096   blocks=0, rtextents=0


what information will be helpful?

regards,

marius..
Comment 5 Dave Chinner 2010-03-25 21:18:43 UTC
"background reclaim flush failed with 11" is not a problem - it's an EAGAIN error and the background reclaim code will retry it properly on the next scan. The warning is there to catch fatal errors, so we need to make sure that it doesn't trigger on EAGAIN...

Cheers,

Dave.
Comment 6 Christoph Hellwig 2010-04-03 10:31:48 UTC
The original fix is now in mainline, so closing.

Note You need to log in before you can comment on or make changes to this bug.