Bug 10981 - XFS filesystem corruption when running out of space
Summary: XFS filesystem corruption when running out of space
Status: CLOSED OBSOLETE
Alias: None
Product: File System
Classification: Unclassified
Component: XFS (show other bugs)
Hardware: All Linux
: P1 normal
Assignee: XFS Guru
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2008-06-25 06:40 UTC by Török Edwin
Modified: 2012-05-22 12:40 UTC (History)
1 user (show)

See Also:
Kernel Version: 2.6.26-rc4-00168-gc3b25b3
Subsystem:
Regression: No
Bisected commit-id:


Attachments

Description Török Edwin 2008-06-25 06:40:46 UTC
Latest working kernel version: Unknown
Earliest failing kernel version: 2.6.26-rc4-00168-gc3b25b3
Distribution: Debian sid
Hardware Environment: 
    Dell Inspiron 6400, CPU Intel Core Duo T2300 @ 1.66Ghz, 1G RAM, HDD 60G 5400 rpm Seagata ST96812as, ICH7 family chipset, 945GM graphics
Software Environment:
   Linux 2.6.26-rc4-00168-gc3b25b3
   Distribution: Debian sid(unstable)
   Kernel built with gcc 4.2
   xfsprogs: latest from Debian (2.9.8-1)
Filesystem information:
   2 partitions: / and /var, both XFS
   Filesystem was created with lazy-count enabled
Problem Description:
  I have run out of space on /, which is using an XFS filesystem. Now the filesystem is corrupt, and unmountable.
 
  I was running the following commands in a gnome terminal:
     dd if=/dev/zero of=xt bs=100M count=9&
     dd if=/dev/zero of=yt bs=100M count=9&
     rm xt; rm yt
    I have run out of space on / (got a message from gnome that it is more than 99% full). 
    After that point I couldn't run any commands (not even cat and df), it said no such command. I guess the root filesystem got unmounted automatically.
    I switched to the console, but it was garbled, and switching back to X was impossible too. I have rebooted, and then I got a failure when mounting / filesystem (see below).
   This is with a 2.6.26-rc4 kernel (and since I cannot boot I can't try building a newer kernel). I also tried booting a 2.6.25-2 distro kernel, and I got a similar error message.
   I've got Fedora 8 installed on another partition, and I could attempt recovery from there. 
   Before doing that is there any information you would need to analyze this problem?
  
Running 'xfs_repair /dev/sda6' from Fedora shows:
Phase 1 - find and verify superblock...
Phase 2 - using internal log
        - zero log...
ERROR: The filesystem has valuable metadata changes in a log which needs to be replayed. Mount the filesystem to replay the log, and unmount it before re-running xfs-repari. If you are unable to mount the filesystem, then use the -L option to destroy the log and attempt a repair.
Note that destroying the log may cause corruption -- please attempt a mount of the filesystem before doing this.

 So should I go ahead and destroy the log, or is there anything inthere that you need to diagnose the problem?


XFS: correcting sb_features alignment problem
XFS mounting filesystem sda6
Starting XFS recovery on filesystem: sda6 (logdev: internal)
00000000: 58 41 47 46 00 00 00 01 00 00 00 04 00 02 54 29  XAGF..........T
Filesystem "sda6": XFS internal error xfs_alloc_read_agf at line 2194 of file
/var/local/src/linux-2.6/fs/xfs/xfs_alloc.c. Caller 0xc023b75a
Pid: 2273, comm: mount Not tainted 2.6.26-rc4-00168-gc3b25b3 #26
 [<c02615c7>] xfs_error_report+0x4e/0x50
 [<c023b75a>] ? xfs_alloc_pagf_init+0x1e/0x3b
 [<c026160c>] xfs_corruption_error+0x43/0x4b
 [<c023b75a>] ? xfs_alloc_pagf_init+0x1e/0x3b
 [<c023b63c>] xfs_alloc_read_agf+0xbb/0x1bb
 [<c023b75a>] ? xfs_alloc_pagf_init+0x1e/0x3b
 [<c023b75a>] xfs_alloc_pagf_init+0x1e/0x3b
 [<c027bf88>] xfs_initialize_perag_data+0xce/0x16a
 [<c027c52a>] xfs_mountfs+0x487/0x69c
 [<c02b08a2>] ? _atomic_dec_and_lock+0x46/0x64
 [<c0288f3c>] ? kmem_zalloc+0xc/0x30
 [<c027d144>] ? xfs_mru_cache_create+0xdc/0x107
 [<c0283287>] xfs_mount+0x2f9/0x342
 [<c029252a>] xfs_fs_fill_super+0xa8/0x1eb
 [<c0283287>] xfs_mount+0x2f9/0x342
 [<c029252a>] xfs_fs_fill_super+0xa9/0x1eb
 [<c017f7f6>] get_sb_bdev+0xea/0x114
 [<c02b139f>] ? idr_pre_get+0x1a/0x44
 [<c0291382>] xfs_fs_get_sb+0x21/0x27
 [<c0292482>] ? xfs_fs_fill_super+0x0/0x1eb
 [<c017f465>] vfs_kern_mount+0x59/0x117
 [<c017f56d>] do_kern_mount+0x33/0xbd
 [<c0194446>] do_new_mount+0x59/0x77
 [<c0195238>] do_mount+0x1ce/0x1e4
 [<c0427bfa>] ? error_code+0x72/0x78
 [<c015007b>] ? acct_file_reopen+0x2/0xf8
 [<c042b7a4>] ? iret_exc+0x418/0x980
 [<c01952ce>] sys_mount+0x80/0xb2
 [<c0103c72>] syscall_call+0x7/0xb
 [<c0420000>] ? detect_ht+0x7e/0x13b
mount: Structure needs cleaning
Begin: Running /scripts/local-bottom ...
Done.
Done.
Begin: Running /scripts/init-bottom ...
Done.
mount: No such file or directory
mount: No such file or directory
Target filesystem doesn't have /sbin/init.
No init found. Try passing init= bootarg.


BusyBox v1.9.2 (Debian 1:1.9.2-3) built-in shell (ash)
Enter 'help' for a list of built-in commands.

/bin/sh: can't access tty; job control turned off
(initramfs)

The error message from 2.6.25-2 distro kernel is similar:   
Filesystem "sda6": XFS internal error xfs_alloc_read_agf at line 2195 of file fs/xfs/xfs_alloc.c. Caller 0xf8a01801
 [<...>] xfs_alloc_read_agf+0x129/0x1a6 [xfs]
 [<...>] xfs_alloc_pagf_init+0x15/0x31 [xfs]
 [<...>] xfs_alloc_pagf_init+0x15/0x31 [xfs]
 [<...>] xfs_alloc_pagf_init+0x15/0x31 [xfs]
 [<...>] xfs_ialloc_pagi_init+0x2d/0x33 [xfs]
 [<...>] xfs_initialize_perag_data+0x69/0x140 [xfs]
 [<...>] xfs_mountfs+0x34a/0x5e3 [xfs]
 [<...>] kmem_alloc+0x53/0xa8 [xfs]
 [<...>] default_wake_function+0x0/0x8 
  .......



Steps to reproduce:
     [steps I have done when problem happened, since the system doesn't boot
I can't try if the same sequence of steps reproduces the problem]
     Have a / filesystem with XFS, with ~800M free space.
     Run these commands:
     dd if=/dev/zero of=xt bs=100M count=9&
     dd if=/dev/zero of=yt bs=100M count=9&
     rm xt; rm yt
     Wait till filesystem is full
     Root filesystem is unaccesible/unmountable

If you need additional information, please ask.
Comment 1 Lachlan McIlroy 2008-06-25 20:34:57 UTC
Mount has encountered a corrupt AGF.  You will have to destroy the log and run
repair so this filesystem can be mounted again.  Fortunately the panic occured
after the log was replayed so the damage should be minimised.  But before you do
that could you do the following?

# xfs_db /dev/sda6
xfs_db> agf 0
xfs_db> print
magicnum = 0x58414746
versionnum = 1
...
xfs_db> agf 1
xfs_db> print
magicnum = 0x58414746
versionnum = 1
...

and keep doing that for each AG in the filesystem and post the output.
There shouldn't be too many AGs (probably about 4).  And if possible run
xfs_metadump on this filesystem.
Comment 2 Török Edwin 2008-06-26 00:18:21 UTC
(In reply to comment #1)
> Mount has encountered a corrupt AGF.  You will have to destroy the log and
> run
> repair so this filesystem can be mounted again.  Fortunately the panic
> occured
> after the log was replayed so the damage should be minimised.  But before you
> do
> that could you do the following?
> 

I had 15 agf, here is the output:
# xfs_db /dev/sda6
xfs_db: cannot init perag data (117)
xfs_db> agf 0
xfs_db> print
magicnum = 0x58414746
versionnum = 1
seqno = 0
length = 152617
bnoroot = 21179
cntroot = 2977
bnolevel = 2
cntlevel = 2
flfirst = 13
fllast = 18
flcount = 6
freeblks = 1809
longest = 10
btreeblks = 4
xfs_db> agf 1
xfs_db> print
magicnum = 0x58414746
versionnum = 1
seqno = 1
length = 152617
bnoroot = 170
cntroot = 648
bnolevel = 1
cntlevel = 1
flfirst = 126
fllast = 1
flcount = 4
freeblks = 1360
longest = 10
btreeblks = 0
xfs_db> agf 2
xfs_db> print
magicnum = 0x58414746
versionnum = 1
seqno = 2
length = 152617
bnoroot = 4350
cntroot = 4490
bnolevel = 1
cntlevel = 1
flfirst = 108
fllast = 111
flcount = 4
freeblks = 1410
longest = 10
btreeblks = 0
xfs_db> agf 3
xfs_db> print
magicnum = 0x58414746
versionnum = 1
seqno = 3
length = 152617
bnoroot = 915
cntroot = 1677
bnolevel = 1
cntlevel = 1
flfirst = 118
fllast = 121
flcount = 4
freeblks = 1640
longest = 10
btreeblks = 0
xfs_db> agf 4
xfs_db> print
magicnum = 0x58414746
versionnum = 1
seqno = 4
length = 152617
bnoroot = 2582
cntroot = 3055
bnolevel = 2
cntlevel = 2
flfirst = 124
fllast = 1
flcount = 6
freeblks = 4294967292
longest = 11
btreeblks = 4
xfs_db> agf 5
xfs_db> print
magicnum = 0x58414746
versionnum = 1
seqno = 5
length = 152617
bnoroot = 169
cntroot = 180
bnolevel = 1
cntlevel = 1
flfirst = 85
fllast = 88
flcount = 4
freeblks = 860
longest = 10
btreeblks = 0
xfs_db> agf 6
xfs_db> print
magicnum = 0x58414746
versionnum = 1
seqno = 6
length = 152617
bnoroot = 315
cntroot = 1213
bnolevel = 1
cntlevel = 1
flfirst = 60
fllast = 63
flcount = 4
freeblks = 998
longest = 11
btreeblks = 0
xfs_db> agf 7
xfs_db> print
magicnum = 0x58414746
versionnum = 1
seqno = 7
length = 152617
bnoroot = 699
cntroot = 753
bnolevel = 1
cntlevel = 1
flfirst = 66
fllast = 69
flcount = 4
freeblks = 849
longest = 10
btreeblks = 0
xfs_db> agf 8
pxfs_db> rint
magicnum = 0x58414746
versionnum = 1
seqno = 8
length = 152617
bnoroot = 12543
cntroot = 12545
bnolevel = 2
cntlevel = 2
flfirst = 34
fllast = 39
flcount = 6
freeblks = 1696
longest = 12
btreeblks = 4
xfs_db> agf 9
pxfs_db> rint
magicnum = 0x58414746
versionnum = 1
seqno = 9
length = 152617
bnoroot = 18838
cntroot = 14324
bnolevel = 2
cntlevel = 2
flfirst = 124
fllast = 1
flcount = 6
freeblks = 4048
longest = 12
btreeblks = 4
xfs_db> agf 10
xfs_db> print
magicnum = 0x58414746
versionnum = 1
seqno = 10
length = 152617
bnoroot = 2102
cntroot = 3239
bnolevel = 1
cntlevel = 1
flfirst = 7
fllast = 10
flcount = 4
freeblks = 1384
longest = 10
btreeblks = 0
xfs_db> agf 11
xfs_db> print
magicnum = 0x58414746
versionnum = 1
seqno = 11
length = 152617
bnoroot = 465
cntroot = 1211
bnolevel = 1
cntlevel = 1
flfirst = 112
fllast = 115
flcount = 4
freeblks = 433
longest = 10
btreeblks = 0
xfs_db> agf 12
prxfs_db> int
magicnum = 0x58414746
versionnum = 1
seqno = 12
length = 152617
bnoroot = 139
cntroot = 265
bnolevel = 2
cntlevel = 2
flfirst = 55
fllast = 60
flcount = 6
freeblks = 3203
longest = 10
btreeblks = 5
xfs_db> agf 13
xfs_db> print
magicnum = 0x58414746
versionnum = 1
seqno = 13
length = 152617
bnoroot = 29595
cntroot = 218
bnolevel = 2
cntlevel = 2
flfirst = 59
fllast = 64
flcount = 6
freeblks = 2018
longest = 10
btreeblks = 4
xfs_db> agf 14
xfs_db> print
magicnum = 0x58414746
versionnum = 1
seqno = 14
length = 152617
bnoroot = 2813
cntroot = 4492
bnolevel = 1
cntlevel = 1
flfirst = 89
fllast = 92
flcount = 4
freeblks = 499
longest = 10
btreeblks = 0
xfs_db> agf 15
xfs_db> print
magicnum = 0x58414746
versionnum = 1
seqno = 15
length = 152617
bnoroot = 76
cntroot = 145
bnolevel = 2
cntlevel = 2
flfirst = 40
fllast = 45
flcount = 6
freeblks = 2482
longest = 10
btreeblks = 4
xfs_db> agf 16
bad allocation group number 16

Here is the metadump of the filesystem (19M bzip2 compressed):
http://www.hotlinkfiles.com/files/1499548_on9su/metadump_1.bz2

I have run xfs_repair -L /dev/sda6, and mounted the filesystem. Here is xfs_info:
meta-data=/dev/sda6              isize=256    agcount=16, agsize=152617 blks
         =                       sectsz=512   attr=2
data     =                       bsize=4096   blocks=2441872, imaxpct=25
         =                       sunit=0      swidth=0 blks, unwritten=1
naming   =version 2              bsize=4096  
log      =internal               bsize=4096   blocks=2560, version=2
         =                       sectsz=512   sunit=0 blks, lazy-count=1
realtime =none                   extsz=65536  blocks=0, rtextents=0

Is there any other information you'd need?
Comment 3 Lachlan McIlroy 2008-06-26 00:45:09 UTC
AGF 4 has a bogus free block count of 4294967292 (should be less than 152617).
Thanks for the metadump, I'll pull it down and see what else I can dig out of it.
Comment 4 Alan 2012-05-22 12:40:24 UTC
Closing as obsolete, please re-open if this is incorrect

Note You need to log in before you can comment on or make changes to this bug.