Bug 195561 - Suspicious persistent EXT4-fs error: ext4_validate_block_bitmap:395: [Proc] bg 17: block 557056: invalid block bitmap
Summary: Suspicious persistent EXT4-fs error: ext4_validate_block_bitmap:395: [Proc] b...
Status: NEW
Alias: None
Product: File System
Classification: Unclassified
Component: ext4 (show other bugs)
Hardware: x86-64 Linux
: P1 high
Assignee: fs_ext4@kernel-bugs.osdl.org
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2017-04-24 02:40 UTC by Mauro Rossi
Modified: 2017-05-14 18:20 UTC (History)
3 users (show)

See Also:
Kernel Version: 4.4 to 4.11
Subsystem:
Regression: No
Bisected commit-id:


Attachments
dmesg on Phy SATA HDD1 (120.63 KB, text/plain)
2017-04-24 02:40 UTC, Mauro Rossi
Details
dmesg on Phy SATA HDD2 (123.08 KB, text/plain)
2017-04-24 02:41 UTC, Mauro Rossi
Details
dmesg on Virtual vdi n.3 (125.15 KB, text/plain)
2017-04-24 02:42 UTC, Mauro Rossi
Details
dmesg on vbox kernel 4.11rc8 (108.48 KB, text/plain)
2017-04-24 02:44 UTC, Mauro Rossi
Details
Lustre HACK that avoids the ext4_error() and signals an ext4_warning() and unproperly silences the errors (13.31 KB, application/mbox)
2017-04-24 03:00 UTC, Mauro Rossi
Details
dumpe2fs of RO physical sdc1 (55.75 KB, text/plain)
2017-04-25 15:22 UTC, Mauro Rossi
Details
dumpe2fs -h of RO physical sdc1 (2.12 KB, text/plain)
2017-04-25 15:23 UTC, Mauro Rossi
Details
Output of: dd if=/dev/sdc1 of=/tmp/block.dat bs=4k count=1 skip=557056 (4.00 KB, application/octet-stream)
2017-04-25 15:27 UTC, Mauro Rossi
Details
Output of: od -Ax4 -tx4 -a /tmp/block.dat (2.51 KB, text/plain)
2017-04-25 15:29 UTC, Mauro Rossi
Details
e2fsck -fn for RO physical sdc1 (7.01 KB, text/plain)
2017-04-25 15:31 UTC, Mauro Rossi
Details
e2fsck -fy for RO physical sdc1 (3.98 KB, text/plain)
2017-04-25 15:32 UTC, Mauro Rossi
Details
dumpe2fs of another RO physical sdc1 60Gb (146.03 KB, text/plain)
2017-04-25 23:13 UTC, Mauro Rossi
Details
dumpe2fs -h of another RO physical sdc1 60Gb (2.12 KB, text/plain)
2017-04-25 23:14 UTC, Mauro Rossi
Details
Output of: dd if=/dev/sdc1 of=/tmp/block_skip_xxxx56.dat bs=4k count=1 skip=557056 (4.00 KB, application/octet-stream)
2017-04-25 23:16 UTC, Mauro Rossi
Details
Output of: od -Ax4 -tx4 -a /tmp/block_skip_xxxx56.dat (2.51 KB, text/plain)
2017-04-25 23:18 UTC, Mauro Rossi
Details
e2fsck -fn for another RO physical sdc1 60Gb (7.55 KB, text/plain)
2017-04-25 23:19 UTC, Mauro Rossi
Details
e2fsck -fy for another RO physical sdc1 60Gb (6.07 KB, text/plain)
2017-04-25 23:20 UTC, Mauro Rossi
Details

Description Mauro Rossi 2017-04-24 02:40:05 UTC
Created attachment 255963 [details]
dmesg on Phy SATA HDD1

While testing Android 7.1 nougat-x86 x86_64 several android-x86 community members noticed the occurence of EXT4 partition remount RO
which causes a bootloop with continuous kernel panic on Android 7.x
which requires to reinstall Android OS image on EXT4 partitions.

When looking in logcat we would just see that everything stops working because of the partion has been remounted in Read-Only.

Looking at dmesg output we see the following attached three logs for three test cases:

Physical Sata HDD 1
Physical Sata HDD 2 
Virtualbox    vdi 3

January, 14th (ASUS motherboard with physical SATA HDD n.1)
[  842.760419] EXT4-fs error (device sda1): ext4_validate_block_bitmap:395: comm Binder:1454_E: bg 17: block 557056: invalid block bitmap
[  842.873601] Aborting journal on device sda1-8.
[  842.908371] EXT4-fs (sda1): Remounting filesystem read-only
[  842.923638] EXT4-fs error (device sda1) in ext4_do_update_inode:4679: Journal has aborted

March, 25th (ASUS motherboard with physical SATA HDD n.2, different from n.1)
[ 1510.269945] EXT4-fs error (device sda1): ext4_validate_block_bitmap:395: comm main: bg 17: block 557056: invalid block bitmap
[ 1510.285464] Aborting journal on device sda1-8.
[ 1510.301047] EXT4-fs (sda1): Remounting filesystem read-only
[ 1510.323400] EXT4-fs error (device sda1) in ext4_do_update_inode:4679: Journal has aborted

April, 25th (VirtualBox VM with vdi vitual drive n.3, different from n.1 and n.2)
[ 1510.269945] EXT4-fs error (device sda1): ext4_validate_block_bitmap:395: comm main: bg 17: block 557056: invalid block bitmap
[ 1510.285464] Aborting journal on device sda1-8.
[ 1510.301047] EXT4-fs (sda1): Remounting filesystem read-only
[ 1510.323400] EXT4-fs error (device sda1) in ext4_do_update_inode:4679: Journal has aborted

What they all have in common is the bg and block which happen to be exactly the same, no matter how many attempts on different physical or virtual HDDs.

The problem is intermittent, but happens quite frequently during initial Google Play updates, so it may become a show stopper for Android and a series of different OSes.

One catalyzer to let the issue happen is multithreading/processes forking which Androdi 7.x uses far more than 6.0. Android 6.0 has no issue with the same kernels. In my understanding there may be a sort block/bg locking issue leading to concurrent write and validation of bitmaps

Another possible concurring root cause may be 64 bit kernel build,
as on virtualbox the issue is systematic with 64 bit build and I've never saw it with 32bit builds. This would be coherent with statements in [1]

Doing some research I found reference of this problem in different websites [1], [2] and [3]

[1] https://community.nxp.com/thread/447695

[2] https://jira.hpdd.intel.com/browse/LU-1026
(at the end EXT4 patch is mentioned)

[3] https://github.com/tweag/lustre/blob/master/ldiskfs/kernel_patches/patches/rhel7/ext4-corrupted-inode-block-bitmaps-handling-patches.patch

The attached HACK workaround can avoid the problem, tested on top of kernel 4.4.62
but it's not a solution as it uses ext4_warning() instead of ext4_error()
and tricks the callers by pretending there was no error,
we could even put a check on "bg == 16 && block == 557056"
but it would still be a hack to workaround a bug in EXT4 bitmap validation code.

It is also confirmed that kernel 4.9, 4.10 and 4.11 are also affected.

Mauro
Comment 1 Mauro Rossi 2017-04-24 02:41:10 UTC
Created attachment 255965 [details]
dmesg on Phy SATA HDD2
Comment 2 Mauro Rossi 2017-04-24 02:42:28 UTC
Created attachment 255967 [details]
dmesg on Virtual vdi n.3
Comment 3 Mauro Rossi 2017-04-24 02:44:05 UTC
Created attachment 255969 [details]
dmesg on vbox kernel 4.11rc8
Comment 4 Mauro Rossi 2017-04-24 02:52:51 UTC
NOTE: April, 25th (VirtualBox VM with vdi vitual drive n.3, different from n.1 and n.2)

April, 25th is typo, that dmesg log was collected on April, 15th

The dmesg log on vbox with kernel 4.11rc8- was collected today and the kernel has been built without ubifs commit [4]

[4] https://github.com/torvalds/linux/commit/9ea33c44fb19e6cbacf65021628f8ce393ba9f34
Comment 5 Mauro Rossi 2017-04-24 03:00:42 UTC
Created attachment 255971 [details]
Lustre HACK that avoids the ext4_error() and signals an ext4_warning() and unproperly silences the errors

NOTE: Lustre hack was backported to kernel 4.4 for testing purposes
and it's not recommended as a correcton
Comment 6 Theodore Tso 2017-04-24 04:51:50 UTC
This message generally means the file system has been corrupted.

[ 1510.269945] EXT4-fs error (device sda1): ext4_validate_block_bitmap:395: comm main: bg 17: block 557056: invalid block bitmap

And that would explain why all of the kernel versions are complaining.   It would be useful to see the output of dumpe2fs or running e2fsck -fy on the file system.

Don't be too quick to assume it's a kernel bug before checking to see if the file system is just corrupted, and the kernel really is correct in complaining!
Comment 7 Andreas Dilger 2017-04-24 16:50:08 UTC
Given that the same corruption is happening across different block devices, it points to something other than the block device going bad.  It might be a bug in the ext4 code, or some other kernel code that is corrupting the memory (unlikely), or a userspace process that is clobbering this block.

It would be worthwhile to save a copy of the corrupted bitmap block for further analysis.  It may be possible to identify what is overwriting that block by looking at the content.

Collect "dumpe2fs -h" output for the filesystem, so we can see what features are enabled.  Collect "dd if=/dev/sda1 of=/tmp/block.dat bs=4k count=1 skip=557056" and then dump it via "od -Ax4 -tx4 -a /tmp/block.dat".

You could also potentially add a tracepoint or run blktrace to see which processes are writing to this block.  It should only be the jbd2 thread.
Comment 8 Andreas Dilger 2017-04-24 16:53:34 UTC
Also, I don't think the "Lustre HACK that avoids the ext4_error() and signals an ext4_warning()" patch is all that bad, and I wouldn't be against landing it.  I think bitmap corruption can happen for a large number of reasons, and localizing the problems to a single block group can avoid taking the whole server out.
Comment 9 Mauro Rossi 2017-04-25 15:22:58 UTC
Created attachment 256009 [details]
dumpe2fs of RO physical sdc1
Comment 10 Mauro Rossi 2017-04-25 15:23:44 UTC
Created attachment 256011 [details]
dumpe2fs -h of RO physical sdc1
Comment 11 Mauro Rossi 2017-04-25 15:27:13 UTC
Created attachment 256013 [details]
Output of: dd if=/dev/sdc1 of=/tmp/block.dat bs=4k count=1 skip=557056
Comment 12 Mauro Rossi 2017-04-25 15:29:36 UTC
Created attachment 256015 [details]
Output of: od -Ax4 -tx4 -a /tmp/block.dat
Comment 13 Mauro Rossi 2017-04-25 15:31:36 UTC
Created attachment 256017 [details]
e2fsck -fn for RO physical sdc1
Comment 14 Mauro Rossi 2017-04-25 15:32:04 UTC
Created attachment 256019 [details]
e2fsck -fy for RO physical sdc1
Comment 15 Mauro Rossi 2017-04-25 23:13:33 UTC
Created attachment 256031 [details]
dumpe2fs of another RO physical sdc1 60Gb
Comment 16 Mauro Rossi 2017-04-25 23:14:21 UTC
Created attachment 256033 [details]
dumpe2fs -h of another RO physical sdc1 60Gb
Comment 17 Mauro Rossi 2017-04-25 23:16:14 UTC
Created attachment 256035 [details]
Output of: dd if=/dev/sdc1 of=/tmp/block_skip_xxxx56.dat bs=4k count=1 skip=557056
Comment 18 Mauro Rossi 2017-04-25 23:18:23 UTC
Created attachment 256037 [details]
Output of: od -Ax4 -tx4 -a /tmp/block_skip_xxxx56.dat
Comment 19 Mauro Rossi 2017-04-25 23:19:20 UTC
Created attachment 256039 [details]
e2fsck -fn for another RO physical sdc1 60Gb
Comment 20 Mauro Rossi 2017-04-25 23:20:05 UTC
Created attachment 256041 [details]
e2fsck -fy for another RO physical sdc1 60Gb
Comment 21 Theodore Tso 2017-04-26 15:11:05 UTC
So the fsck outputs demonstrate that the file system really *is* getting corrupted.  It's not an erroneous message.   So switching between kernels after the file system has been corrupted does not mean that the newer kernels have whatever bug might have caused the corruption.   The question is which kernel version *corrupted* the file system in the first place.

Since you are using an x86 kernel, my suggestion is before you try debugging it in an Android context, that you take that kernel and run a full set of regression tests on it.   See http://thunk.org/gce-xfstests for a very handy way to run the regression tests.  If you don't want to pay the cost for runnning tests in the cloud (a few pennies for each 30 minute smoke test, and around USD$ 1.50 for the full regression test), you can also use kvm-xfstests.  That will take longer, and it ties up a machine while the test is running (where as you can fire off many tests in parallel using gce-xfstests, and just wait for the test reports to be e-mailed back to you).

Even when I was trying to debug ARM kernels, I would often convert/bludgeon the BSP kernel so that the non-portable hacks added by the vendors could be worked around so the kernel could be compiled for x86, just because running the regression tests was worth it.   These days, on an ARM android system, we do have something (probably alpha or very early beta quality) that will allow you to run the tests in a chroot.   This is primarily helpful you are trying to debug something like hardware In-line crypto, that is only available from a particular ARM SOC.    For more information, please see:

https://github.com/tytso/xfstests-bld/blob/master/Documentation/android-xfstests.md

One warning.... many mobile handsets have ah.... "cost-optimized flash", which may be subject to early write exhaustion and massive write amplifications when stressed.  So if you try to run xfstests on your mobile handset, do it on a throwaway development machine where the flash is considered sacrificial.
Comment 22 Mauro Rossi 2017-04-29 21:59:14 UTC
> Another possible concurring root cause may be 64 bit kernel build,
> as on virtualbox the issue is systematic with 64 bit build and I've never saw 
>  it with 32bit builds.

Quoting myself, because now I saw the issue also on 32bit android/32bit kernel

(In reply to Theodore Tso from comment #21)
> So the fsck outputs demonstrate that the file system really *is* getting
> corrupted.  It's not an erroneous message.   So switching between kernels
> after the file system has been corrupted does not mean that the newer
> kernels have whatever bug might have caused the corruption.   The question
> is which kernel version *corrupted* the file system in the first place.

When I stated that all kernel version between 4.4 and 4.11 are affected,
I haven't changed kernel after corruption, but always rebuilt with those different kernels, installed Android cleaning EXT4 partition, booted and updated Google Playstore/apps.

The Android installations based on different kernel versions (rebuilt and reinstalled to different hard drives) show the same issue and the lustre patches are undoubtedly a mitigation/workaround, still working on 4.11.
Those patches have been brewed for Linux Red Hat.

The newest kernels I'm using have minimal changes compared to torvalds/master,
and no changes were made to fs/ext4, 4.11rc7 based one is here:

https://github.com/maurossi/linux/tree/kernel-4.11rc7


> Since you are using an x86 kernel, my suggestion is before you try debugging
> it in an Android context, that you take that kernel and run a full set of
> regression tests on it.   See http://thunk.org/gce-xfstests for a very handy
> way to run the regression tests.  If you don't want to pay the cost for
> runnning tests in the cloud (a few pennies for each 30 minute smoke test,
> and around USD$ 1.50 for the full regression test), you can also use
> kvm-xfstests.  That will take longer, and it ties up a machine while the
> test is running (where as you can fire off many tests in parallel using
> gce-xfstests, and just wait for the test reports to be e-mailed back to you).
> 
> Even when I was trying to debug ARM kernels, I would often convert/bludgeon
> the BSP kernel so that the non-portable hacks added by the vendors could be
> worked around so the kernel could be compiled for x86, just because running
> the regression tests was worth it.   These days, on an ARM android system,
> we do have something (probably alpha or very early beta quality) that will
> allow you to run the tests in a chroot.   This is primarily helpful you are
> trying to debug something like hardware In-line crypto, that is only
> available from a particular ARM SOC.    For more information, please see:
> 
> https://github.com/tytso/xfstests-bld/blob/master/Documentation/android-
> xfstests.md
> 
> One warning.... many mobile handsets have ah.... "cost-optimized flash",
> which may be subject to early write exhaustion and massive write
> amplifications when stressed.  So if you try to run xfstests on your mobile
> handset, do it on a throwaway development machine where the flash is
> considered sacrificial.

Thanks, for android-x86 I test on laptops and desktops with magnetic HDD
I'll try blktrace which is available with android sources and android-xfstests,
I've also contacted the original author of the ext4 LU-1026 patch, to get additional clues.

Mauro
Comment 23 Theodore Tso 2017-04-30 02:59:10 UTC
What version of AOSP are you using, and how recently have you refreshed with latest AOSP master (if you are using AOSP master)?

The reason why I ask is because there have been a number of changes that have recently landed in AOSP over the last couple of months changing how ext4 file systems are created.   They used to use make_ext4fs, and they are now using mke2fs plus a collection of Android-specific utilities in e2fsprogs/contrib/android.    (Note there are a lot of patches in AOSP's e2fsprogs that haven't been integrated into mainline e2fsprogs yet.)

No one else is complaining about problems in the Linux kernel, and given that you are seeing this problem across a huge spread of kernels, I'm wondering if the problem is in the verison of AOSP you are using and whether it is creating file systems which are sane.   I know that version being used internally at Google is working, but it's possible (a) that what has been pushed out to AOSP doesn't have all of the bug fixes that they are using, or (b) you haven't doing a repo sync, and the bug has since been fixed in AOSP master, or (c) they have been doing mostly ARM-based testing and the problem hasn't been noticed in android-x86 yet.   (I have no idea which branches various Android development efforts are using; it's not my main area of focus and I've been way to busy recently to pay enough attention here.   So there is a bit of guessing here.)

Something you might want to try is try installing the file system where you've been having trouble --- and then checking it with e2fsck before the kernel has a chance to mount it read/write.   If e2fsck is reporting problems before the kernel has mounted it, or mounted it read/write, then it's almost certainly a userspace bug in AOSP somewhere.

One other thought.   I'm pretty sure there are some device kernels using 4.4 under development, but I'm *certain* there are device kernels using 3.18 (including, as it happens, my Pixel XL running Android 7.1.2).   So if you haven't tried 3.18, give it a quick spin.   If you are seeing the problem on 3.18, then it's almost certainly not a kernel bug, but Something Else.  The thing I come back to is that I'm not seeing any other complaints similar to yours from Desktop distributions using Linux, or from people with shipped devices, or people internally at Google doing development on Android O.   (Which is out as a developer preview.  :-)
Comment 24 Theodore Tso 2017-04-30 03:20:42 UTC
Hmm, the original bug report said you are using Android 7.1.  Which I think was still using make_ext4fs.   I really don't trust make_ext4fs farther than I can throw it.   Something else to try --- use "dd if=/dev/zero of=/dev/sdXX" before doing whatever you do to install the file system.   There are multiple ways of calling into the  make_ext4fs codepaths (from fastboot, from the recovery partition, as part of the build process), and some of them didn't work correctly if the underlying storage device wasn't zero'ed out (as would be the case if you were using flash and the DISCARD command, or the eMMC equivalent).   I wonder if the issue has to do with a slightly different way the file system is getting set up on android-x86, and if you are using magnetic HDD's that doesn't have discard, that might be way you're seeing it and other Android developers aren't.

I got tired of getting pulled in to debug weird problems that were traced down to make_ext4fs being fragile (it works fine writing to sparse files, since uninitialized blocks in a sparse file are all zeroes), which is why I was very happy to see Android switching over to use mke2fs, instead of make_ext4fs, which was a clean-room implementation back in the day when some Android management was overly paranoid about GPLv2 in userspace.  There were some rough spots since mke2fs -d doesn't deal with SELinux and other Android specific issues, which is why e2fsprogs/contrib/android exists in the AOSP version of e2fsprogs.   (It's there in the upstream version of e2fsprogs, but there a bunch of bug fixes in the AOSP version that haven't propagated back to the upstream e2fsprogs repo yet.)
Comment 25 James tao 2017-05-03 08:17:17 UTC
I am working on Broadcom ARM chipset base on Android 7.0. I met the same problem. and the kernel version is 4.1.20.
I set a tf card as "internal share storage". When I try to copy a file(file size must more than 2.3G) from data partition to sdcard(the filesystem of tf card is ext4). 
 cp /data/xxx.data /sdcard/ 

the Kernel print log:
[  982.761411] EXT4-fs error (device dm-1): ext4_validate_block_bitmap:380: comm kworker/u4:2: bg 17: block 557056: invalid block bitmap
[  982.815546] EXT4-fs (dm-1): Remounting filesystem read-only

TF card mount as readonly.
Comment 26 Theodore Tso 2017-05-03 17:48:32 UTC
Can you use e2fsck -fy /dev/devXX on the sdcard *before* you copy the file?

How was the sdcard formatted?

Can you give me a clean reproduction that doesn't involve using AOSP userspace?

For example, on a build an x86 kernel and then boot it using using kvm-xfstests[1].

[1] https://github.com/tytso/xfstests-bld/blob/master/Documentation/kvm-quickstart.md

Then try to reproduce it there.  For example:

% kvm-xfstests shell
...
root@kvm-xfstests:~# mke2fs -F -t ext4 /dev/vdc
mke2fs 1.43.5-WIP-ed1e950f (26-Apr-2017)
/dev/vdc contains a ext4 file system
	last mounted on /vdc on Wed May  3 13:44:28 2017
Creating filesystem with 1310720 4k blocks and 327680 inodes
Filesystem UUID: 15426c7f-5695-4cd8-9b2e-78288883b877
Superblock backups stored on blocks: 
	32768, 98304, 163840, 229376, 294912, 819200, 884736

Allocating group tables: done                            
Writing inode tables: done                            
Creating journal (16384 blocks): done
Writing superblocks and filesystem accounting information: done 

root@kvm-xfstests:~# mount /dev/vdc /vdc
[   84.418730] EXT4-fs (vdc): mounted filesystem with ordered data mode. Opts: (null)
root@kvm-xfstests:~# dd if=/dev/zero of=/vdc/test.img bs=1G count=3
3+0 records in
3+0 records out
3221225472 bytes (3.2 GB) copied, 18.6019 s, 173 MB/s
[  107.766622] dd (2616) used greatest stack depth: 5924 bytes left
root@kvm-xfstests:~# umount /vdc
root@kvm-xfstests:~# e2fsck -fy /dev/vdc
e2fsck 1.43.5-WIP-ed1e950f (26-Apr-2017)
Pass 1: Checking inodes, blocks, and sizes
Pass 2: Checking directory structure
Pass 3: Checking directory connectivity
Pass 4: Checking reference counts
Pass 5: Checking group summary information
/dev/vdc: 12/327680 files (0.0% non-contiguous), 828511/1310720 blocks
root@kvm-xfstests:~# 

(You can exit the VM by typing Control-A x -- control-A followed by the 'x' character.)
Comment 27 James tao 2017-05-09 03:20:50 UTC
How was the sdcard formatted?
> sdcard formatted by make_ext4fs.

Can you give me a clean reproduction that doesn't involve using AOSP userspace?
> I had try the way you provided. but I can't access to the git
> server(git.kernel).

[taohua@ubuntu-bsp ext4_issue]git clone git://git.kernel.org/pub/scm/fs/ext2/xfstests-bld.git fstests
Cloning into 'fstests'...
fatal: unable to connect to git.kernel.org:
git.kernel.org[0: 147.75.110.187]: errno=Connection timed out
git.kernel.org[1: 2604:1380:3000:3500::3]: errno=Network is unreachable

[taohua@ubuntu-bsp ext4_issue]ping git.kernel.org
PING pub.nrt.kernel.org (147.75.110.187) 56(84) bytes of data.
64 bytes from 147.75.110.187: icmp_req=1 ttl=53 time=57.2 ms
64 bytes from 147.75.110.187: icmp_req=2 ttl=53 time=57.1 ms
64 bytes from 147.75.110.187: icmp_req=3 ttl=53 time=57.1 ms

Another way, I have try to run only linux (not Android) on the Broadcom ARM chipset. and using the same kernel version. the test step as follow.
   1. mkfs.ext4 /dev/mmcblk0                     #format tf card as ext4
   2. mount -t ext4 /dev/mmcblk0 /mnt/hd/        #mount tf card
   3. mount -t ext4 /dev/mmcblk1p14 /mnt/flash/  # mount emmc partition as ext4
   4. cp /mnt/flash/*.data /mnt/hd/              #copy data to tf card
  
  # df
  Filesystem           1K-blocks      Used Available Use% Mounted on
  rootfs                  106944     10480     96464  10% /
  none                    106944         0    106944   0% /dev
  tmpfs                   118076         8    118068   0% /tmp
  shm                     118076         0    118076   0% /dev/shm
  /dev/mmcblk0           7381352   3124064   3859288  45% /mnt/hd
  /dev/mmcblk1p14        4536544   3116564   1166492  73% /mnt/flash

By the test in the same hardware everything looks good. Is it not an ext4 issue, but a issue of make_ext4fs?
Comment 28 Andreas Dilger 2017-05-09 05:01:03 UTC
As Ted has mentioned a couple of times, you could try running make_ext4fs and then immediately running "e2fsck -f /dev/XXXX" on the filesystem created by make_ext4fs _before_ it is mounted by the kernel.  It is entirely possible that make_ext4fs is creating a corrupted filesystem, and the kernel is detecting this at runtime.  This would explain why the filesystem is corrupted in the same way for all of the different devices.
Comment 29 Theodore Tso 2017-05-09 16:10:16 UTC
If your firewall is not letting you access git.kernel.org, please complain to your management.

There is a mirror of the xfstests-bld git tree on github: https://github.com/tytso/xfstests-bld

However, to access the prebuilt test appliace files, or some of the other git repositories needed by xfstests-bld if you want to do a build from scratch of the test appliance, you will need access to either www.kernel.org or git.kernel.org, so you might as well deal with your firewall configuration issues sooner rather than later.

I basically don't trust make_ext4fs at all.   There is a reason I have been pushing the android team to use e2fsprogs.  We do have an updated e2fsprogs in AOSP, and some of the tools so that make_ext4fs can be replaced by mke2fs plus some helper programs are mostly in AOSP.  They are not used by default, but rather the Android team has been cutting devices over one at a time for Android O, instead of whole sale.  My personal suspicion is that they very paranoid because they are used to make_ext4fs breaking at random, so they have been doing a slow migration over because they fear similar breakages as they migrate away from make_ext4fs, particular as it relates to building a new system.img file and doing a factory reset from the device.   However, if all you need to do is format an ext4 file system on an SD card, and not try to set up a system image or something else complicated where you have to set up SELinux labels, etc., using mke2fs should be just fine.   It's what the rest of the Linux ecosystem use, and what we upstream developers use for all of our regression testing and daily personal use.   

To be clear, we do ****not**** make_ext4fs; that's an Android "special" that was originally developed when certain management/legal types had a pathological fear of the GPLv2, and then it grew all out of control and manageability.   Since it was done as a clean room reimplementation, by someone who was not an ext4 expert, and then had extra ext4 features as well as SELinux spliced into it over time, it has caused me no end of headaches over the years....   :-(

If so Linaro folks could help accelerate the move away from make_ext4fs, I would be most grateful.
Comment 30 Mauro Rossi 2017-05-13 12:00:16 UTC
Hi,
I've checked with new installation of nougat-x86 from scratch and without booting, so it is clean as per "installation just finished"
and the free blocks counts seem to be effectively tainted by make_ext4fs.

[1] seem to be Android's make_ext4fs patch for the problem
[2] is a commit where in Android master they move away from make_ext4fs


utente@utente-System-Product-Name:~$ e2fsck -fn /dev/sdb1
e2fsck 1.43.4 (31-Jan-2017)

Warning: skipping journal recovery because doing a read-only filesystem check.
Pass 1: Checking inodes, blocks, and sizes
Pass 2: Checking directory structure
Pass 3: Checking directory connectivity
Pass 4: Checking reference counts
Pass 5: Checking group summary information
Block bitmap differences:  +(557056--557060) +(557066--557068) +557070 +557072 +(557074--557075) +(557077--557090) +557092 +(557094--557095) +557097 +(557099--557102) +557104 +557106 +(557108--557109) +(557111--557248) +(557250--557280) +(557282--557326) +(557328--557358) +(557360--557379) +557381 +(557389--557488) +557490 +(557492--557503) +(557506--557507) +557509 +557511 +557516 +(557521--557536) +(557538--557566) -557664 -(557728--557729) -557731 -557768 -557776 -557780 -557782 -(557794--557796) -557825 -557830 -(557856--557857) -557860 -(557888--557889) -(557897--557901) -(557904--557907) -(557911--557914) -(557918--557919) -(557922--557923) -557926 -(557929--557935) -557938 -557940 -557943 -557946 -557948 -557950 -(557954--557957) -(557959--557960) -(557963--557964) -557966 -(557968--557972) -(557974--557977) -(557980--557982) -557984 -557987 -557989 -557993 -557995 -(557998--557999) -(558001--558002) -(558007--558008) -(558010--558011) -558013 -558016 -558022 -(558025--558027) -(558029--558030) -558034 -(558037--558038) -558041 -(558044--558046) -(558048--558051) -(558053--558054) -558056 -558059 -(558061--558062) -558066 -(558069--558070) -558072 -(558074--558075) -558077 -(558083--558086) -(558091--558093) -(558097--558098) -(558100--558101) -558714 -558851 -559073 -559080 -559093 -(559842--559844) -(559858--559860) -559873 +(950272--950276) +(950282--950284) +950286 +950288 +(950290--950291) +(950293--950306) +950308 +(950310--950311) +950313 +(950315--950318) +950320 +950322 +(950324--950325) +(950327--950464) +(950466--950496) +(950498--950542) +(950544--950574) +(950576--950595) +950597 +(950605--950704) +950706 +(950708--950719) +(950722--950723) +950725 +950727 +950732 +(950737--950752) +(950754--950782) -950880 -(950944--950945) -950947 -950984 -950992 -(950994--950998) -(951010--951012) -951041 -951046 -(951072--951073) -951076 -(951104--951105) -(951113--951117) -(951120--951123) -(951127--951130) -(951134--951135) -(951138--951139) -951142 -(951145--951151) -951154 -951156 -951159 -951162 -951164 -951166 -(951170--951173) -(951175--951176) -(951179--951180) -951182 -(951184--951188) -(951190--951193) -(951196--951198) -951200 -951203 -951205 -951209 -951211 -(951214--951215) -(951217--951218) -(951223--951224) -(951226--951227) -951229 -951232 -951238 -(951241--951243) -(951245--951246) -951250 -(951253--951254) -951257 -(951260--951262) -(951264--951267) -(951269--951270) -951272 -951275 -(951277--951278) -951282 -(951285--951286) -951288 -(951290--951291) -951293 -(951299--951302) -(951307--951309) -(951313--951314) -(951316--951317) -951930 -952067 -952289 -952296 -952309 -(953058--953060) -(953074--953076) -953089
Fix? no

Free blocks count wrong for group #17 (32257, counted=32586).
Fix? no

Free blocks count wrong for group #29 (32257, counted=32583).
Fix? no

Free blocks count wrong (4763034, counted=4376260).
Fix? no

Free inodes count wrong (1221589, counted=1215139).
Fix? no


Android-x86: ********** WARNING: Filesystem still has errors **********

Android-x86: 11/1221600 files (254.5% non-contiguous), 120718/4883752 blocks



References:

[1] https://android.googlesource.com/platform/system/extras/+/c71eaf37486bed9163ad528f51de29dd56b34fd2

[2] https://android.googlesource.com/platform/system/extras/+/3f6ea671d55b0f8ba9bab8826c817327d67ee9bb
Comment 31 Theodore Tso 2017-05-14 04:14:02 UTC
So I'm just curious.  Android's fix for the problem landed in October 2016[1].  It's now May 2017.  Why is it that so many folks are still using an old version of AOSP that apparently does not have this fix?   

[1] https://android.googlesource.com/platform/system/extras/+/c71eaf37486bed9163ad528f51de29dd56b34fd2
Comment 32 Mauro Rossi 2017-05-14 11:41:43 UTC
(In reply to Theodore Tso from comment #31)
> So I'm just curious.  Android's fix for the problem landed in October
> 2016[1].  It's now May 2017.  Why is it that so many folks are still using
> an old version of AOSP that apparently does not have this fix?   
> 
> [1]
> https://android.googlesource.com/platform/system/extras/+/
> c71eaf37486bed9163ad528f51de29dd56b34fd2

The reason is that commit is not in the Android release cycle,
it landed in master, but we're building nougat-x86 with latest android-7.1.2_r11 tagged version.

The mentioned problem should affect 32 bit only, or is 64bit expected to be affected too, with the explanations given in the commit message?
Mauro
Comment 33 Theodore Tso 2017-05-14 14:56:19 UTC
Well, the commit seems to imply that it's only for 32-bit platforms.   I am not an expert on make_ext4fs, and I don't have the AOSP sources on my laptop, so it's not something I can easily investigate at the moment.  But it would explain a lot of things; the Android team at the time would have been only focusing on 64-bit devices, and while e2fsck and mke2fs in e2fsprogs has plenty of regression tests, and I *do* run them on 32-bit platforms from time to time for e2fsprogs, make_ext4fs.... not so much.   (As far as I know it has no regression tests.)

Which brings me to my next question I'm asking out of curiosity.   Why, in 2017, are you trying to build 32-bit x86?    Is it just to try to save RAM?   Are you trying to selflessly try to find 32-bit bugs when most device manufacturers are focusing on 64-bit architectures?   :-)

(Don't get me wrong; I do KVM kernel testing for ext4 using a 32-bit x86 platform partially because it's more RAM economical, and because as an upstream developer I am interesting in sanity checking to make sure we haven't introduced any 32-bit regressions.   So there are good reasons to do it, but for me I'm primarily *looking* to find problems --- in other words, I'm knowingly asking for it.  :-)
Comment 34 Mauro Rossi 2017-05-14 18:20:46 UTC
(In reply to Theodore Tso from comment #33)
> Well, the commit seems to imply that it's only for 32-bit platforms.   I am
> not an expert on make_ext4fs, and I don't have the AOSP sources on my
> laptop, so it's not something I can easily investigate at the moment.  But
> it would explain a lot of things; the Android team at the time would have
> been only focusing on 64-bit devices, and while e2fsck and mke2fs in
> e2fsprogs has plenty of regression tests, and I *do* run them on 32-bit
> platforms from time to time for e2fsprogs, make_ext4fs.... not so much.  
> (As far as I know it has no regression tests.)
> 
> Which brings me to my next question I'm asking out of curiosity.   Why, in
> 2017, are you trying to build 32-bit x86?    Is it just to try to save RAM? 
> Are you trying to selflessly try to find 32-bit bugs when most device
> manufacturers are focusing on 64-bit architectures?   :-)

The reason for x86 builds in android-x86 is that x86_64 builds require SSE4_1 and SSE4_2 and unlike in linux OS, there is no way to avoid it.
Also, Android Media Player (Fugu) images are 32 bit user space.

So it is still relatively pretty much used 

> 
> (Don't get me wrong; I do KVM kernel testing for ext4 using a 32-bit x86
> platform partially because it's more RAM economical, and because as an
> upstream developer I am interesting in sanity checking to make sure we
> haven't introduced any 32-bit regressions.   So there are good reasons to do
> it, but for me I'm primarily *looking* to find problems --- in other words,
> I'm knowingly asking for it.  :-)

Note You need to log in before you can comment on or make changes to this bug.