Bug 200371

Summary: Unable to Mount… EXT4: First Meta block group too large
Product: File System Reporter: mcolgin
Component: ext4Assignee: fs_ext4 (fs_ext4)
Status: RESOLVED PATCH_ALREADY_AVAILABLE    
Severity: normal CC: mcolgin, tytso
Priority: P1    
Hardware: x86-64   
OS: Linux   
Kernel Version: 3.10.0-514.26.2.el7.x86_64 Subsystem:
Regression: No Bisected commit-id:

Description mcolgin 2018-06-29 18:34:48 UTC
ERROR
=====
EXT4-fs (dm-46): first meta block group too large: 1152 (group descriptor block count 1096)

Environment
===========
I have a filesystem, which is ext4, on a machine with 4.17.2-1.el7.elrepo.x86_64. The logical volume is 8.56 TiB.

This filesystem has been resized many times since it was first created.

e2fsck
======
e2fsck 1.42.9 (28-Dec-2013)
/dev/vg_areca/lv_MYLVNAME: clean, 59060/574488576 files, 2294844696/2297954304 blocks


DESC
====
After the last "resize2fs", I started receiving this error via "dmesg". I was running kernel 3.10.0-514.26.2.el7.x86_64 and after some googling, found that the current version of linux corrected a bug that produced this error. However, after updating the kernel, I am still getting the error.

My thoughts are that my multiple resizes has surfaced an issue, but I'm not sure how to address it.


LINKS
=====

Previous Fix to Kernel with similar error, but appears to be symptomatic

https://www.novell.com/support/kb/doc.php?id=7018898


Results of "debugfs" output (minus 70k entries for Group 70127: (Blocks 2297921536-2297954303) [INODE_UNINIT, ITABLE_ZEROED]") 

https://pastebin.com/EN1xyBAM
Comment 1 Theodore Tso 2018-06-29 23:20:17 UTC
So I'm really interested in how the file system got to that state.  If you have the history of how the file system was resized up until now, that would be really useful.

In any case, the file system really is corrupted, although the good news is that should be a relatively simple thing to fix; you just need to upgrade to a non-prehistoric version of e2fsprogs.

It looks like you are using RHEL 7 kernel and e2fsprogs.  As such, you should really be getting support from Red Hat --- and they may very well tell you that using a file system this big isn't something Red Hat doesn't support.  Given that they are using super-ancient versions of the kernel and e2fsprogs (possibly with some bug fixes and features backported), that might be quite fair.  But because they do backport code, it's really not something that upstream developers can really support.  This is why Red Hat customers pay the Big Buckets to Red Hat.  :-)

In any case, e2fsck from e2fsprogs 1.44.2 should be able to repair it.  Using it may void your Red Hat support contract, though --- in which case the right answer is to file a bug with Red Hat and ask them to fix it.

# (MKE2FS_FIRST_META_BG=1152 mke2fs -t ext4 -O meta_bg,^resize_inode -b 4k /tmp/foo.img 2297954304)
mke2fs 1.44.2 (14-May-2018)
Creating regular file /tmp/foo.img
Creating filesystem with 2297954304 4k blocks and 287244288 inodes
Filesystem UUID: 02abae05-96a0-4cbe-85fe-3c2d0c97cf4e
Superblock backups stored on blocks: 
	32768, 98304, 163840, 229376, 294912, 819200, 884736, 1605632, 2654208, 
	4096000, 7962624, 11239424, 20480000, 23887872, 71663616, 78675968, 
	102400000, 214990848, 512000000, 550731776, 644972544, 1934917632

Allocating group tables: done                            
Writing inode tables: done                            
Creating journal (262144 blocks): done
Writing superblocks and filesystem accounting information: done       

# mount /tmp/foo.img /mnt
mount: /mnt: wrong fs type, bad option, bad superblock on /dev/loop1, missing codepage or helper program, or other error.

# dmesg | tail -2
[110950.298537] EXT4-fs (loop1): mounted filesystem with ordered data mode. Opts: (null)
[111476.144952] EXT4-fs (loop1): first meta block group too large: 1152 (group descriptor block count 1096)

# e2fsck -f /tmp/foo.img
e2fsck 1.44.2 (14-May-2018)
First_meta_bg is too big.  (1152, max value 1096).  Clear<y>? yes
Pass 1: Checking inodes, blocks, and sizes
Pass 2: Checking directory structure
Pass 3: Checking directory connectivity
Pass 4: Checking reference counts
Pass 5: Checking group summary information
Block bitmap differences:  -(1097--1152)
Fix<y>? yes
Free blocks count wrong for group #0 (27482, counted=27538).
Fix<y>? yes
Free blocks count wrong for group #1 (31615, counted=31671).
Fix<y>? yes
Free blocks count wrong for group #3 (31615, counted=31671).
Fix<y>? yes
Free blocks count wrong for group #5 (31615, counted=31671).
Fix<y>? yes
Free blocks count wrong for group #7 (31615, counted=31671).
Fix<y>? yes
Free blocks count wrong for group #9 (31615, counted=31671).
Fix<y>? yes
Free blocks count wrong for group #25 (31615, counted=31671).
Fix<y>? yes
Free blocks count wrong for group #27 (31615, counted=31671).
Fix<y>? yes
Free blocks count wrong for group #49 (31615, counted=31671).
Fix<y>? yes
Free blocks count wrong for group #81 (31615, counted=31671).
Fix<y>? yes
Free blocks count wrong for group #125 (31615, counted=31671).
Fix<y>? yes
Free blocks count wrong for group #243 (31615, counted=31671).
Fix<y>? yes
Free blocks count wrong for group #343 (31615, counted=31671).
Fix<y>? yes
Free blocks count wrong for group #625 (31615, counted=31671).
Fix<y>? yes
Free blocks count wrong for group #729 (31615, counted=31671).
Fix<y>? yes
Free blocks count wrong for group #2187 (31615, counted=31671).
Fix<y>? yes
Free blocks count wrong for group #2401 (31615, counted=31671).
Fix<y>? yes
Free blocks count wrong for group #3125 (31615, counted=31671).
Fix<y>? yes
Free blocks count wrong for group #6561 (31615, counted=31671).
Fix<y>? yes
Free blocks count wrong for group #15625 (31615, counted=31671).
Fix<y>? yes
Free blocks count wrong for group #16807 (31615, counted=31671).
Fix<y>? yes
Free blocks count wrong for group #19683 (31615, counted=31671).
Fix<y>? yes
Free blocks count wrong for group #59049 (31615, counted=31671).
Fix<y>? yes
Free blocks count wrong (2279572611, counted=2279573899).
Fix<y>? yes

/tmp/foo.img: ***** FILE SYSTEM WAS MODIFIED *****
/tmp/foo.img: 11/287244288 files (0.0% non-contiguous), 18380405/2297954304 blocks

# mount /tmp/foo.img /mnt

# df /mnt
Filesystem      1K-blocks  Used  Available Use% Mounted on
/dev/loop1     9118295620    24 8658688352   1% /mnt
Comment 2 Theodore Tso 2018-06-29 23:24:56 UTC
By the way, e2fsprogs 1.42.9 has a huge number of resize2fs bugs when doing off-line resizes.  I don't know how many of them Red Hat may or may not have backported into that ancient version of e2fsprogs, but if you insist on using Red Hat's e2fsprogs, I'd strongly recommend that you stick with on-line resizes (that is, with the file system mounted). 

The 3.10 kernel also has an untold number of bugs that have since fixed upstream; again, I can't speak to how many of them have been backported to Red Hat's RHEL kernel.

So if you're going to use RHEL 7 software, I strongly suggest you get Red Hat support.
Comment 3 mcolgin 2018-07-03 18:33:23 UTC
@theodore 

RE:
So I'm really interested in how the file system got to that state.  If you have the history of how the file system was resized up until now, that would be really useful.


I went through my "~/.bash_history" to pull out the commands used that lead to the error. The commands listed under "HISTORY" were performed throughout 2017 and the LV was grown incrementally over time with relative sizing.

HISTORY
=======
lvextend --size +200G /dev/vg_areca/lv_mylvname
resize2fs -f /dev/vg_areca/lv_mylvname
lvextend --size +100G /dev/vg_areca/lv_mylvname
resize2fs /dev/vg_areca/lv_mylvname
lvextend --size +100G /dev/vg_areca/lv_mylvname
resize2fs -f /dev/vg_areca/lv_mylvname
lvextend --size +500G /dev/vg_areca/lv_mylvname
lvextend --size +500G /dev/vg_areca/lv_mylvname
resize2fs -f /dev/vg_areca/lv_mylvname


These "HISTORY" commands are many months old, the command which led to the error is here.. in which this LV was locked down to a specific size, as no other files were going to be added to it.. NOTE: the "tune2fs" which reduced free blocks, my retrieving of "--getbsz" to get an absolutely blocks needed for the subsequent "lvreduce --size 8766G"

OP THAT LEAD TO ERROR
=====================
e2fsck -f /dev/vg_areca/lv_mylvname
tune2fs -m 0.0 /dev/vg_areca/lv_mylvname
blockdev --getbsz /dev/vg_areca/lv_mylvname
resize2fs /dev/vg_areca/lv_mylvname 8766G
lvreduce --size 8766G /dev/vg_areca/lv_mylvname 
mount /dev/vg_areca/lv_mylvname