Bug 203943 - ext4 corruption after RAID6 degraded; e2fsck skips block checks and fails
Summary: ext4 corruption after RAID6 degraded; e2fsck skips block checks and fails
Status: NEW
Alias: None
Product: File System
Classification: Unclassified
Component: ext4 (show other bugs)
Hardware: Intel Linux
: P1 normal
Assignee: fs_ext4@kernel-bugs.osdl.org
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2019-06-21 06:51 UTC by Yann Ormanns
Modified: 2019-06-22 08:06 UTC (History)
2 users (show)

See Also:
Kernel Version: 4.19.52-gentoo
Subsystem:
Regression: No
Bisected commit-id:


Attachments

Description Yann Ormanns 2019-06-21 06:51:27 UTC
I use a 24TB SW-RAID6 with 10 3TB HDDs. This array contains a dmcrypt container, which contains a ext4 FS [1].
One of the disks had errors and got kicked out of the array. Before I was able to replace it, the ext4 FS began to throw errors like these:

EXT4-fs error (device dm-1): ext4_find_dest_de:1802: inode #3833864: block 61343924: comm nfsd: bad entry in directory: rec_len % 4 != 0 - offset=1000, inode=2620025549, rec_len=30675, name_len=223, size=4096
EXT4-fs error (device dm-1): ext4_lookup:1577: inode #172824586: comm tvh:tasklet: iget: bad extra_isize 13022 (inode size 256)
EXT4-fs error (device dm-1): htree_dirblock_to_tree:1010: inode #7372807: block 117967811: comm tar: bad entry in directory: rec_len % 4 != 0 - offset=104440, inode=1855122647, rec_len=12017, name_len=209, size=4096 

I then used e2fsck to check the FS for errors, but it only created dozens of the following output lines:
German original: "Block %$b von Inode %$i steht in Konflikt mit kritischen Metadaten, Blockprüfungen werden übersprungen."
Translated: "Inode %$i block %$b conflicts with critical metadata, skipping block checks." 
It also filled up my RAM, so I added ~100G of swap space. After scanning for a few days, the e2fsck process died.

Afterwards I was able to replace the disk, re-scan the FS with e2fsck and after 2 hours, the ext4 FS was clean again.

To understand the problem, I set one of the disks faulty, and suddenly the ext4 errors occurred again. I added a spare disk, resynced the array, but e2fsck is unable to fix the errors, and only throws "Inode %$i block %$b conflicts with critical metadata, skipping block checks." while using more and more RAM.

Used software versions:
Kernel 4.19.52-gentoo
e2fsprogs-1.44.5
mdadm-4.1

Please let me know, if I should provide additional information.

Kind regards,
Yann

########################################## additional information

#e2fsck -c /dev/mapper/share 
e2fsck 1.44.5 (15-Dec-2018)
badblocks: ungültige letzter Block - 5860269055
/dev/mapper/share: Updating bad block inode.
Durchgang 1: Inodes, Blöcke und Größen werden geprüft
Block %$b von Inode %$i steht in Konflikt mit kritischen Metadaten, Blockprüfungen werden übersprungen.
[...]
Block %$b von Inode %$i steht in Konflikt mit kritischen Metadaten, Blockprüfungen werden übersprungen.
Signal (6) SIGABRT si_code=SI_TKILL 
e2fsck(+0x33469)[0x55ec4942a469]
/lib64/libc.so.6(+0x39770)[0x7f562449f770]
/lib64/libc.so.6(gsignal+0x10b)[0x7f562449f6ab]
/lib64/libc.so.6(abort+0x123)[0x7f5624488539]
/lib64/libext2fs.so.2(+0x18af5)[0x7f5624ac3af5]
e2fsck(+0x18c30)[0x55ec4940fc30]
/lib64/libext2fs.so.2(+0x1a1ec)[0x7f5624ac51ec]
/lib64/libext2fs.so.2(+0x1a589)[0x7f5624ac5589]
/lib64/libext2fs.so.2(+0x1b04b)[0x7f5624ac604b]
e2fsck(+0x1978c)[0x55ec4941078c]
e2fsck(+0x1b0f6)[0x55ec494120f6]
e2fsck(+0x1b1dc)[0x55ec494121dc]
/lib64/libext2fs.so.2(ext2fs_get_next_inode_full+0x8e)[0x7f5624adc53e]
e2fsck(e2fsck_pass1+0xa12)[0x55ec49412d22]
e2fsck(e2fsck_run+0x6a)[0x55ec4940b54a]
e2fsck(main+0xefd)[0x55ec4940703d]
/lib64/libc.so.6(__libc_start_main+0xeb)[0x7f5624489e6b]
e2fsck(_start+0x2a)[0x55ec494092ea]

---------------------------------------------------------------------

#tune2fs -l /dev/mapper/share
tune2fs 1.44.5 (15-Dec-2018)
Filesystem volume name:   <none>
Last mounted on:          /home/share
Filesystem UUID:          c5f0559d-e3bd-473f-abc0-7c42b3115897
Filesystem magic number:  0xEF53
Filesystem revision #:    1 (dynamic)
Filesystem features:      ext_attr dir_index filetype extent 64bit flex_bg sparse_super large_file huge_file uninit_bg dir_nlink extra_isize
Filesystem flags:         signed_directory_hash 
Default mount options:    user_xattr acl
Filesystem state:         clean with errors
Errors behavior:          Continue
Filesystem OS type:       Linux
Inode count:              366268416
Block count:              5860269056
Reserved block count:     0
Free blocks:              755383351
Free inodes:              363816793
First block:              0
Block size:               4096
Fragment size:            4096
Group descriptor size:    64
Blocks per group:         32768
Fragments per group:      32768
Inodes per group:         2048
Inode blocks per group:   128
RAID stride:              128
RAID stripe width:        1024
Flex block group size:    16
Filesystem created:       Sat Mar 17 14:36:16 2018
Last mount time:          Fri Jun 21 05:25:34 2019
Last write time:          Fri Jun 21 05:30:27 2019
Mount count:              3
Maximum mount count:      -1
Last checked:             Thu Jun 20 08:55:17 2019
Check interval:           0 (<none>)
Lifetime writes:          139 GB
Reserved blocks uid:      0 (user root)
Reserved blocks gid:      0 (group root)
First inode:              11
Inode size:               256
Required extra isize:     28
Desired extra isize:      28
Default directory hash:   half_md4
Directory Hash Seed:      4c37872e-3207-4ff4-8939-a428feaeb49f
Journal backup:           inode blocks
FS Error count:           20776
First error time:         Thu Jun 20 14:18:47 2019
First error function:     ext4_lookup
First error line #:       1577
First error inode #:      172824586
First error block #:      0
Last error time:          Fri Jun 21 05:53:24 2019
Last error function:      ext4_lookup
Last error line #:        1577
Last error inode #:       172824586
Last error block #:       0

---------------------------------------------------------------------

#cat /proc/mdstat
Personalities : [raid0] [raid1] [raid10] [raid6] [raid5] [raid4] [multipath] [faulty] 
md2 : active raid6 sde[0] sdc[9] sdb[8] sdj[7] sdi[6] sda[11] sdd[10] sdh[3] sdg[2] sdf[1]
      23441080320 blocks super 1.2 level 6, 512k chunk, algorithm 2 [10/10] [UUUUUUUUUU]
      bitmap: 0/22 pages [0KB], 65536KB chunk

md1 : active raid1 sdk2[0] sdl2[2]
      52396032 blocks super 1.2 [2/2] [UU]
      
md0 : active raid1 sdk1[0] sdl1[2]
      511680 blocks super 1.2 [2/2] [UU]
      
unused devices: <none>
Comment 1 Andreas Dilger 2019-06-21 07:15:59 UTC
This seems like a RAID problem and not an ext4 problem. The RAID array shouldn't be returning random garbage if one of the drives is unavailable. Maybe it is not doing data parity verification on reads, so that it is blindly returning bad blocks from the failed drive rather than reconstructing valid data from parity if the drive does not fail completely?
Comment 2 Theodore Tso 2019-06-21 12:40:55 UTC
Did you resync the disks *before* you ran e2fsck?   Or only afterwards?
Comment 3 Yann Ormanns 2019-06-21 14:52:01 UTC
Andreas & Ted, thank you for your replies.

(In reply to Andreas Dilger from comment #1)
> This seems like a RAID problem and not an ext4 problem. The RAID array
> shouldn't be returning random garbage if one of the drives is unavailable.
> Maybe it is not doing data parity verification on reads, so that it is
> blindly returning bad blocks from the failed drive rather than
> reconstructing valid data from parity if the drive does not fail completely?

How can I check that? At least running "checkarray" did not find anything new or helpful.

(In reply to Theodore Tso from comment #2)
> Did you resync the disks *before* you ran e2fsck?   Or only afterwards?

1. my RAID6 got degraded and ext4 errors showed up
2. I ran e2fsck, it consumed all  memory and showed only "Inode %$i block %$b conflicts with critical metadata, skipping block checks."
3. I replaced the faulty disk and resynced the RAID6
4. e2fsck was able to clean the filesystem
5. I simulated a drive fault (so my RAID6 had n+1 working disks left)
6. the ext4 FS got corrupted again
7. although the RAID is clean again, e2fsck is not able to clean the FS (like in step 2)
Comment 4 Theodore Tso 2019-06-21 18:29:07 UTC
That sounds *very* clearly as a RAID bug.   If RAID6 is returning garbage to the file system in degraded mode, there is nothing the file system can do.

What worries me is if the RAID6 system was returning garbage when *reading* who knows how it was trashing the file system image when the ext4 kernel code was *writing* to it?

In any case, there's very little we as ext4 developers can do here to help, except give you some advice for how to recover your file system.

What I'd suggest that you do is to use the debugfs tool to sanity check the inode.  If the inode number reported by e2fsck was 123456, you can look at it by using the debugfs command: "stat <123456>".    If the timestamps, user id and group id numbers, etc, look insane, you can speed up the recovery time by using the command "clri <123456>", which zeros out the inode.
Comment 5 Yann Ormanns 2019-06-21 19:34:36 UTC
Thank you for your support, Ted. About one week before the RAID got degraded, I created a full file-based backup, so at least I can expect minimal data loss - I just would like to save the time it would take to copy ~22TB back :-)

Before filing a possible bug in the correct section for RAID, I'd like to comprehend the steps you described. 
In fact e2fsck does not report inode numbers, only variables ("%$i", and so does it for blocks, "%$b"). 
Using the the total inode count number in debugfs leads to an error:
share ~ # tune2fs -l /dev/mapper/share | grep "Inode count"
Inode count:              366268416
share ~ # debugfs /dev/mapper/share 
debugfs 1.44.5 (15-Dec-2018)
debugfs:  stat 366268416
366268416: File not found by ext2_lookup

Or did I get you wrong?
Comment 6 Theodore Tso 2019-06-21 20:47:31 UTC
That's because the German translation is busted.   I've complained to the German maintainer of the e2fsprogs messages file at the Translation Project, but it hasn't been fixed yet.   See [1] for more details.

[1] https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=892173#10

It *should* have actually printed the inode number, and it does if you just use the built-in English text.  In general, I *really* wish people would disable the use of the translation when reporting bugs, because very often the use of the translations add noise that make life harder for developers, and in this case, it was losing information for you, the system administrator.

See the debugfs man page for more details, but to specify inode numbers to debugfs, you need to surround the inode number with angle brackets.  Here are some examples:

% debugfs /tmp/test.img
debugfs 1.45.2 (27-May-2019)
debugfs:  ls a
 12  (12) .    2  (12) ..    13  (988) motd   
debugfs:  stat /a/motd
Inode: 13   Type: regular    Mode:  0644   Flags: 0x80000
Generation: 0    Version: 0x00000000
User:     0   Group:     0   Size: 286
File ACL: 0
Links: 1   Blockcount: 2
Fragment:  Address: 0    Number: 0    Size: 0
ctime: 0x5d0d416e -- Fri Jun 21 16:43:26 2019
atime: 0x5d0d416e -- Fri Jun 21 16:43:26 2019
mtime: 0x5d0d416e -- Fri Jun 21 16:43:26 2019
Inode checksum: 0x0000857d
EXTENTS:
(0):20
debugfs:  stat <13>
Inode: 13   Type: regular    Mode:  0644   Flags: 0x80000
Generation: 0    Version: 0x00000000
User:     0   Group:     0   Size: 286
File ACL: 0
Links: 1   Blockcount: 2
Fragment:  Address: 0    Number: 0    Size: 0
ctime: 0x5d0d416e -- Fri Jun 21 16:43:26 2019
atime: 0x5d0d416e -- Fri Jun 21 16:43:26 2019
mtime: 0x5d0d416e -- Fri Jun 21 16:43:26 2019
Inode checksum: 0x0000857d
EXTENTS:
(0):20
debugfs:  ncheck 13
Inode	Pathname
13	/a/motd 
debugfs:  stat a
Inode: 12   Type: directory    Mode:  0755   Flags: 0x80000
Generation: 0    Version: 0x00000000
User:     0   Group:     0   Size: 1024
File ACL: 0
Links: 2   Blockcount: 2
Fragment:  Address: 0    Number: 0    Size: 0
ctime: 0x5d0d416b -- Fri Jun 21 16:43:23 2019
atime: 0x5d0d416b -- Fri Jun 21 16:43:23 2019
mtime: 0x5d0d416b -- Fri Jun 21 16:43:23 2019
Inode checksum: 0x000042d4
EXTENTS:
(0):18
debugfs:  block_dump 18
0000  0c00 0000 0c00 0102 2e00 0000 0200 0000  ................
0020  0c00 0202 2e2e 0000 0d00 0000 dc03 0401  ................
0040  6d6f 7464 0000 0000 0000 0000 0000 0000  motd............
0060  0000 0000 0000 0000 0000 0000 0000 0000  ................
*
1760  0000 0000 0000 0000 0c00 00de be4e 16e9  .............N..

debugfs:
Comment 7 Yann Ormanns 2019-06-22 07:48:47 UTC
I changed the locale to en_GB and got now the actual 5 or 6 inodes, which caused the extremely slow procedure of e2fsck. I zeroed them with debugfs and afterwards, e2fsck was able to fix many, many errors.
lost+found contains 934G of data (53190 entries). During the next days, I will try to examine them.

While copying files out of my backup, I re-tested a failure of a disk an removed it from the array. After a while I re-added it and now the array is rsyncing. I will watch the logs for possible FS errors - for now, all seems clean.

Thanks a lot for your support!
Comment 8 Yann Ormanns 2019-06-22 08:06:09 UTC
It seems, as if the data corruption only affects the directories and files, that the system wrote to while the ext4 FS was not clean.
A quick rsync between my backup fileserver copied nothing of the static data back to the productive system - so I assume, that these files should be okay (but of course, I will test that at least randomly).

To sum it up: the data loss affects my backup files locally on this system and my tv recordings - 934G of total data size are just about right for that.

Note You need to log in before you can comment on or make changes to this bug.