Repeated or even reproducible occurrences of EXT4-fs error: ext4_mb_generate_buddy Last known good version: 3.1.5 More info on the above URL; possible trigger in the last message.
Currently using 3.1.7 without issues.
I've potentially seen this with 3.2.1/3.2.5 (Fedora 16) messages-20120219:Feb 16 18:11:40 laptop kernel: [688848.718886] EXT4-fs error (device dm-1): mb_free_blocks:1348: group 28, block 924740:freeing already freed block (bit 7236) messages-20120219:Feb 16 18:11:40 laptop kernel: [688848.718890] EXT4-fs error (device dm-1): mb_free_blocks:1348: group 28, block 924741:freeing already freed block (bit 7237) ... messages:Feb 19 04:46:02 laptop kernel: [106683.790999] EXT4-fs error (device dm-1): ext4_mb_generate_buddy:739: group 311, 15861 clusters in bitmap, 9309 in gd messages:Feb 19 04:46:02 laptop kernel: [106683.874849] EXT4-fs error (device dm-1): ext4_mb_generate_buddy:739: group 321, 30343 clusters in bitmap, 30295 in gd but... the laptop in question is a quite dodgy hardware-wise (wiggle the case a bit and the Thinklight blinks etc.). Typical light laptop use with root/home on LVM the way Fedora puts them by default. So far it's getting ext4 errors every few days, reboot & fsck and it's fine for a while, running 3.2.6 now.
I have a filesystem that exhibits the problem in the original mailing-list post: EXT4-fs (dm-3): mounted filesystem with ordered data mode. Opts: (null) <copy data into the filesystem> EXT4-fs error (device dm-3): ext4_mb_generate_buddy:738: group 186, 32764 blocks in bitmap, 32766 in gd Aborting journal on device dm-3-8. EXT4-fs (dm-3): Remounting filesystem read-only EXT4-fs (dm-3): ext4_da_writepages: jbd2_start: 9223372036854775807 pages, ino 452; err -30 fsck'ing the filesystem will repair it. Performing the exact same copy again will reproduce the exact error. I can reproduce this 100% of the time and have bisected it back to commit d5b8f31007a93777cfb0603b665858fb7aebebfc Author: Theodore Ts'o <tytso@mit.edu> Date: Fri Sep 9 18:44:51 2011 -0400 ext4: bigalloc changes to block bitmap initialization functions Add bigalloc support to ext4_init_block_bitmap() and ext4_free_blocks_after_init(). It's still present in Linus' tree as of a9e1e53bcfb29b3b503a5e75ce498d9a64f32c1e (10-Apr). I'm more than happy to help out with more information or testing patches since I've got it 100% reproducible.
I encounter the same problems with an Intel SSD and ext4 using the latest kernel of Ubuntu 12.04 (3.2.0.24.26) or even a 3.4.0-030400rc4 version. Going back to the Kernel release 3.0.0-19 of Ubuntu 11.10 makes the problems going away. More information https://bugs.launchpad.net/ubuntu/+source/linux/+bug/992424 If you need any more information please let me know.
It also started on Ubuntu maverick when I updated to 2.6.35.32.42
Ted, Robert bisected at least one instance of this to a specific commit w.r.t. bigalloc; do you know if it's been resolved?
Unfortunately I am not currently in a position to easily test. I've got only one machine using ext4 and that has been running 3.1.7 since I set it up, reported this issue and found that to be the last working version.
I'm pretty sure the bug the user was seeing is the one which was introduced in 3.2, and was fixed by commit b0dd6b70f0fd, which landed in 3.5, and then backported to 3.2.20 and 3.4.3. Birger first saw it in 3.2, and fell back to 3.1.7 where he didn't see it. And the symptoms matched up. So that's almost certainly it. Note: the file system wasn't actually corrupted at all; ext4 was just confused. :-) commit b0dd6b70f0fda17ae9762fbb72d98e40a4f66556 Author: Theodore Ts'o <tytso@mit.edu> Date: Thu Jun 7 18:56:06 2012 -0400 ext4: fix the free blocks calculation for ext3 file systems w/ uninit_bg Ext3 filesystems that are converted to use as many ext4 file system features as possible will enable uninit_bg to speed up e2fsck times. These file systems will have a native ext3 layout of inode tables and block allocation bitmaps (as opposed to ext4's flex_bg layout). Unfortunately, in these cases, when first allocating a block in an uninitialized block group, ext4 would incorrectly calculate the number of free blocks in that block group, and then errorneously report that the file system was corrupt: EXT4-fs error (device vdd): ext4_mb_generate_buddy:741: group 30, 32254 clusters in bitmap, 3 This problem can be reproduced via: mke2fs -q -t ext4 -O ^flex_bg /dev/vdd 5g mount -t ext4 /dev/vdd /mnt fallocate -l 4600m /mnt/test The problem was caused by a bone headed mistake in the check to see if a particular metadata block was part of the block group. Many thanks to Kees Cook for finding and bisecting the buggy commit which introduced this bug (commit fd034a84e1, present since v3.2). Reported-by: Sander Eikelenboom <linux@eikelenboom.it> Reported-by: Kees Cook <keescook@chromium.org> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Tested-by: Kees Cook <keescook@chromium.org> Cc: stable@kernel.org
A quick check of both b0dd6b70f0fd and its parent showed it fixed for me as well.
I also get this error every few days. Using Fedora 17 3.6.3-1.fc17.x86_64
b0dd6b70f0fd is present in v3.5. Benjamin, does e2fsck -f find anything on the filesystem in question?
_Fedora-17-x86_6: recovering journal Clearing orphaned inode 36955 (uid=1000, gid=1000, mode=0100600, size=192) Clearing orphaned inode 36947 (uid=1000, gid=1000, mode=0100600, size=816) Clearing orphaned inode 1716 (uid=1000, gid=1000, mode=0100600, size=2119) Clearing orphaned inode 262914 (uid=1000, gid=1000, mode=0100644, size=25086) Clearing orphaned inode 262828 (uid=1000, gid=1000, mode=0100644, size=25086) Clearing orphaned inode 923 (uid=1000, gid=1000, mode=0100600, size=4104) Clearing orphaned inode 262737 (uid=1000, gid=1000, mode=0100644, size=25086) Clearing orphaned inode 262857 (uid=1000, gid=1000, mode=0100644, size=25086) Clearing orphaned inode 36946 (uid=1000, gid=1000, mode=0100600, size=4104) Clearing orphaned inode 1714 (uid=1000, gid=1000, mode=0100600, size=2056) Clearing orphaned inode 263736 (uid=1000, gid=1000, mode=0100644, size=25086) Clearing orphaned inode 393328 (uid=1000, gid=1000, mode=0100644, size=33344) Clearing orphaned inode 36944 (uid=1000, gid=1000, mode=0100600, size=2052) Clearing orphaned inode 36943 (uid=1000, gid=1000, mode=0100600, size=20500) Clearing orphaned inode 2480 (uid=1000, gid=1000, mode=0100600, size=16400) Clearing orphaned inode 1720 (uid=1000, gid=1000, mode=0100600, size=2048) Clearing orphaned inode 1717 (uid=1000, gid=1000, mode=0100600, size=2056) Clearing orphaned inode 1715 (uid=1000, gid=1000, mode=0100600, size=512) Clearing orphaned inode 902 (uid=1000, gid=1000, mode=0100600, size=2048) Clearing orphaned inode 798 (uid=1000, gid=1000, mode=0100600, size=2056) Clearing orphaned inode 792 (uid=1000, gid=1000, mode=0100600, size=512) Clearing orphaned inode 264090 (uid=1000, gid=1000, mode=0100644, size=25086) Clearing orphaned inode 791 (uid=1000, gid=1000, mode=0100600, size=4096) Clearing orphaned inode 263737 (uid=1000, gid=1000, mode=0100644, size=25086) Pass 1: Checking inodes, blocks, and sizes Inodes that were part of a corrupted orphan linked list found. Fix<y>? yes Inode 791 was part of the orphaned inode list. FIXED. Inode 1716 was part of the orphaned inode list. FIXED. Inode 36947 was part of the orphaned inode list. FIXED. Pass 2: Checking directory structure Pass 3: Checking directory connectivity Pass 4: Checking reference counts Pass 5: Checking group summary information Block bitmap differences: +(1262130--1262131) +1313070 -(1685212--1685213) Fix<y>? yes Free blocks count wrong for group #40 (11881, counted=11880). Fix<y>? yes Free blocks count wrong (11735574, counted=11735652). Fix<y>? yes Inode bitmap differences: -(791--792) -798 -902 -923 -(1714--1717) -1720 -2480 -(36943--36944) -(36946--36947) -36955 -393328 -396626 +397606 Fix<y>? yes Free inodes count wrong for group #0 (0, counted=11). Fix<y>? yes Free inodes count wrong for group #4 (20, counted=24). Fix<y>? yes Free inodes count wrong for group #48 (714, counted=715). Fix<y>? yes Free inodes count wrong (3471460, counted=3471499). Fix<y>? yes _Fedora-17-x86_6: ***** FILE SYSTEM WAS MODIFIED ***** _Fedora-17-x86_6: ***** REBOOT LINUX ***** _Fedora-17-x86_6: 182133/3653632 files (0.9% non-contiguous), 2856348/14592000 blocks
(In reply to comment #12) > _Fedora-17-x86_6: recovering journal Hm wonder why the journal was dirty ? clearing orphaned inodes is fine > Pass 1: Checking inodes, blocks, and sizes > Inodes that were part of a corrupted orphan linked list found. Fix<y>? yes > Inode 791 was part of the orphaned inode list. FIXED. > Inode 1716 was part of the orphaned inode list. FIXED. > Inode 36947 was part of the orphaned inode list. FIXED. > Pass 2: Checking directory structure > Pass 3: Checking directory connectivity > Pass 4: Checking reference counts > Pass 5: Checking group summary information > Block bitmap differences: +(1262130--1262131) +1313070 -(1685212--1685213) > Fix<y>? yes > Free blocks count wrong for group #40 (11881, counted=11880). > Fix<y>? yes > Free blocks count wrong (11735574, counted=11735652). > Fix<y>? yes Those might have led to the errors you saw, let's see if that clears up the runtime errors. -Eric
I have no idea why the journal was dirty... Let's see if it happens again. Very annoying. I've already lost some data because of this.
I haven't done extensive testing due to other issues, but at lest 3.7.0 and 3.8.0 seem to be ok. However immediately thereafter (git linux-stable) I got random contents.
Hi, i know this thread is very old. But I am getting exactly this error on my crucial M4 SSD. It happens every three days. This is a gentoo box with most recent packages. (happened with 3.9 and 3.10rc kernel.) I am thinking about a faulty hardware, but I really dont know. Any hints for me where to look at?
I am seeing this on all these debian kernel versions: ii linux-image-3.14-0.bpo.1-686-pae 3.14.7-1~bpo70+1 i386 Linux 3.14 for modern PCs ii linux-image-3.15-trunk-686-pae 3.15.3-1~exp1 i386 Linux 3.15 for modern PCs ii linux-image-3.2.0-4-686-pae 3.2.57-3+deb7u2 i386 Linux 3.2 for modern PCs Debian kernel reporting howto instructs me to report this upstream here when the issue is reproductible with latest upstream version. [ 308.192011] EXT4-fs (md124): error count: 1 [ 308.192063] EXT4-fs (md124): initial error at 1404174299: ext4_mb_generate_buddy:739 [ 308.192143] EXT4-fs (md124): last error at 1404174299: ext4_mb_generate_buddy:739 Is this the same bug or should I file a new issue in bugzilla?
Created attachment 142111 [details] dmesg attached dmesg
The EXT4-fs error reported by "ext4_mb_generate_buddy" simply means that there is an inconsistency between the block allocation bitmap and the number of free blocks reported in a particular block group descriptor. An e2fsck should fix this. There are some people who are reporting that it is showing up more frequently in 3.15, but the problem is there can be multiple causes of this inconsistency. It can be caused by a hardware problem; it can be caused by flash devices that don't handle power failures gracefully; etc, etc. If you can reliably reproduce it, and it doesn't reliably reproduce on an older kernel, then it becomes interesting. As far as "error count" message, this simply means that an EXT4-fs error has been reported since the last time the file system was checked using e2fsck. It will be displayed once every 24 hours, so that system administrators hopefully get the hinit and get their file system repaired before they lose catastrophic amounts of data. The format is confusing, and we will be trying to make the error messages more clear. In particular "initial error at 1404174299" will probably need to get changed to something like "first error since e2fsck happened at time 1404174299", which is the number of seconds since January 1, 1970 --- since the kernel doesn't have time zone information. You can convert that into a human readable time stamp as follows: % date -d @1404174299 Mon Jun 30 20:24:59 EDT 2014 The "error count" message will get cleared by e2fsck, so long as you are using a version of e2fsck newer than 1.41.13, released in December 2010. People who upgrade their kernels without upgrading e2fsck can get into a situation where they continuously get the error count message even though they have gotten their file system checked. But if you are using that ancient version of e2fsprogs, you **really** want to upgrade to a newer version, just because of the huge number of bugs fixed in the last 3.5 years.
Thank you so much for your kind clarification, Theo. After re-reading your reply and my dmesg I found out I was simply fsck-ing the wrong device :) I did fsck the reporting filesystem and the issue went away. Cheers :)
When trying to upgrade my deabian 7.5 box to 7.6 with "apt-get upgrade", i also met this issue recently. And i checked its source and found that commit b0dd6b70f0fd had been applied to my box kernel. [ 35.500785] EXT4-fs (xvda1): re-mounted. Opts: errors=remount-ro,barrier=0 [ 35.526502] loop: module loaded [ 36.367517] RPC: Registered named UNIX socket transport module. [ 36.367520] RPC: Registered udp transport module. [ 36.367521] RPC: Registered tcp transport module. [ 36.367523] RPC: Registered tcp NFSv4.1 backchannel transport module. [ 36.378239] FS-Cache: Loaded [ 36.393865] FS-Cache: Netfs 'nfs' registered for caching [ 36.405048] Installing knfsd (copyright (C) 1996 okir@monad.swb.de). [ 38.196190] EXT4-fs error (device xvda1): ext4_mb_generate_buddy:739: group 33, 9127 clusters in bitmap, 9143 in gd [ 38.198208] Aborting journal on device xvda1-8. [ 38.201249] EXT4-fs (xvda1): Remounting filesystem read-only [ 38.202165] EXT4-fs error (device xvda1) in ext4_reserve_inode_write:4507: Journal has aborted [ 38.203509] EXT4-fs error (device xvda1): ext4_journal_start_sb:327: Detected aborted journal [ 38.207454] EXT4-fs error (device xvda1) in ext4_reserve_inode_write:4507: Journal has aborted [ 38.210148] EXT4-fs error (device xvda1) in ext4_ext_remove_space:2779: Journal has aborted [ 38.212614] EXT4-fs error (device xvda1) in ext4_reserve_inode_write:4507: Journal has aborted [ 38.215156] EXT4-fs error (device xvda1) in ext4_ext_truncate:4420: Journal has aborted [ 38.217590] EXT4-fs error (device xvda1) in ext4_reserve_inode_write:4507: Journal has aborted [ 38.220327] EXT4-fs error (device xvda1) in ext4_orphan_del:2110: Journal has aborted [ 38.227881] EXT4-fs error (device xvda1) in ext4_reserve_inode_write:4507: Journal has aborted # uname -a Linux AY140708130459758e83Z 3.2.0-4-amd64 #1 SMP Debian 3.2.60-1+deb7u3 x86_64 GNU/Linux
When trying to run fsck, i got the similar info: fsck from util-linux 2.20.1 e2fsck 1.42.5 (29-Jul-2012) /dev/xvda1: recovering journal Pass 1: Checking inodes, blocks, and sizes Deleted inode 265617 has zero dtime. Fix<y>? yes Inodes that were part of a corrupted orphan linked list found. Fix<y>? yes Inode 917518 was part of the orphaned inode list. FIXED. Inode 917533 was part of the orphaned inode list. FIXED. Inode 917537 was part of the orphaned inode list. FIXED. Inode 917539 was part of the orphaned inode list. FIXED. Inode 917541 was part of the orphaned inode list. FIXED. Deleted inode 917545 has zero dtime. Fix<y>? yes Pass 2: Checking directory structure Pass 3: Checking directory connectivity Pass 4: Checking reference counts Pass 5: Checking group summary information Block bitmap differences: -(41472--41862) -(41920--41952) -(42464--42485) -(53184--53217) -(1092376--1092379) -(1092388-- 1092399) Fix<y>? yes Inode bitmap differences: -265617 -917518 -917533 -917537 -917539 -917541 -917545 Fix<y>? yes Free inodes count wrong for group #32 (1429, counted=1430). Fix<y>? yes Free inodes count wrong (1273982, counted=1273983). Fix<y>? yes /dev/xvda1: ***** FILE SYSTEM WAS MODIFIED ***** /dev/xvda1: ***** REBOOT LINUX ***** /dev/xvda1: 36737/1310720 files (0.2% non-contiguous), 359493/5242368 blocks
Zhiyong, The ext4_mb_generate_buddy() error message just means that there is an inconsistency between the allocation bitmap and the allocation statistics in the block group descriptor. So people who just google for "ext4_mb_generate_buddy" can often find old bugs that had similar symptoms. The most recent bugfix in this area was caused by a backport of commit: 007649375f6af2 ext4: initialize multi-block allocator before checking block descriptors which caused errors after a replay. It was reverted in mainline commit: f9ae9cf5d72b39 ext4: revert commit which was causing fs corruption after journal replays You might want to check to see if that might be applicable for your kernel.
Ted, Thanks a lot for your advice, i will try it. If anything take place, i will post it here.
Hi, Ted I checked debian 7.5/6 kernel src code, it hasn't merged commit 007649375f6af2 ,so this issue whih i met seems to be not caused by this commit. (In reply to Theodore Tso from comment #23) > Zhiyong, The ext4_mb_generate_buddy() error message just means that there > is an inconsistency between the allocation bitmap and the allocation > statistics in the block group descriptor. So people who just google for > "ext4_mb_generate_buddy" can often find old bugs that had similar symptoms. > The most recent bugfix in this area was caused by a backport of commit: > 007649375f6af2 ext4: initialize multi-block allocator before checking block > descriptors which caused errors after a replay. It was reverted in > mainline commit: f9ae9cf5d72b39 ext4: revert commit which was causing fs > corruption after journal replays You might want to check to see if that > might be applicable for your kernel.