Bug 177141

Summary: btrfs v4.8 regression tests fail
Product: File System Reporter: Bruce Dubbs (bruce.dubbs)
Component: btrfsAssignee: Josef Bacik (josef)
Status: RESOLVED CODE_FIX    
Severity: normal CC: dsterba
Priority: P1    
Hardware: x86-64   
OS: Linux   
Kernel Version: 4.7.2 Subsystem:
Regression: No Bisected commit-id:
Attachments: Test resuots for convert-tests/001-ext2-basic
kernel config for hanging btrfs regression test

Description Bruce Dubbs 2016-10-09 21:47:30 UTC
Created attachment 241261 [details]
Test resuots for convert-tests/001-ext2-basic

I cannot get the regression tests to pass.  There are several places but for example:

$ tar -xf btrfs-progs-v4.8.tar.xz
$ cd btrfs-progs-v4.8
$ ./configure
$ make
$ cd tests
$ su
# TEST=001\* ./convert-tests.sh    
    [TEST/conv]   001-ext2-basic
    [TEST/conv]     ext2 4k nodesize, btrfs defaults
file permission failed. Mismatched BTRFS::bd256723e3257df600725b340c52b7af EXT::af17bd5250e6bcc0333babe55219a963
test failed for case 001-ext2-basic
Comment 1 David Sterba 2016-10-10 10:25:43 UTC
Works for me here. The filename where the check failed should be between the :, and it's empty, so it could be some shell quoting problem or wrong parsing of the acl values.
Comment 2 Bruce Dubbs 2016-10-10 18:29:20 UTC
The shell in use is:

$ echo $BASH_VERSION
4.3.42(1)-release

I'll note that for  btrfs v4.7.2 all the regression tests passed without problem, but btrfs v4.7.3 had similar problems (not reported).
Comment 3 David Sterba 2016-10-11 08:57:50 UTC
Thanks, that narrows things down a bit.
Comment 4 Bruce Dubbs 2016-11-04 21:12:49 UTC
Created attachment 243641 [details]
kernel config for hanging btrfs regression test
Comment 5 Bruce Dubbs 2016-11-04 21:14:17 UTC
I solved some of this problem.  I did not have all the btrfs options in the kernel configured. 

Now I get a hang during the tests.  The current output is:

...
    [TEST/conv]     ext4 64k nodesize, btrfs no-holes
    [TEST/conv]   004-ext2-backup-superblock-ranges
    [TEST/conv]   005-delete-all-rollback
    [TEST/conv]     ext4 4k nodesize, btrfs defaults

The tests have been hanging here for over 15 minutes and the system load average is 0.0.  Looking for suggestions.  Is there some other kernel setting required.
I'll attach my .config file.  I attached my kernel config file.

I would also like to suggest that a short discussion of the regression tests be added in the INSTALL file.  In particular, mention the needed kernel options.
Comment 6 David Sterba 2016-11-07 14:55:33 UTC
Can you please be more specific about the options? I'm not aware of anything special to set in the config, I regularly test with:

CONFIG_BTRFS_FS=m
CONFIG_BTRFS_FS_POSIX_ACL=y
# CONFIG_BTRFS_FS_CHECK_INTEGRITY is not set
# CONFIG_BTRFS_FS_RUN_SANITY_TESTS is not set
# CONFIG_BTRFS_DEBUG is not set
CONFIG_BTRFS_ASSERT=y

There's a tests/README.md that would be probably suitable for the additional information, but at the moment I don't know what exactly to put there.
Comment 7 Bruce Dubbs 2016-11-07 16:43:13 UTC
Originally when I started this ticket I only had CONFIG_BTRFS_FS set.  Now I have:

CONFIG_BTRFS_FS=y                                                               
CONFIG_BTRFS_FS_POSIX_ACL=y                                                     
CONFIG_BTRFS_FS_CHECK_INTEGRITY=y                                               
CONFIG_BTRFS_FS_RUN_SANITY_TESTS=y                                              
CONFIG_BTRFS_DEBUG=y                                                            CONFIG_BTRFS_ASSERT=y

I'm still concerned about the hang at 'ext4 4k nodesize, btrfs defaults'  I'll rebuild my kernel to match your settings and post results.  I note that what I have now for the test logs is:

-rw-r--r-- 1 root root  21M Nov  4 16:08 convert-tests-results.txt
-rw-r--r-- 1 root root 354K Nov  4 16:02 fsck-tests-results.txt
-rw-r--r-- 1 root root 214K Nov  4 16:02 mkfs-tests-results.txt

I don't know if the convert-tests file size is normal or not.

About the tests/README.md file, it would be useful to mention it in the INSTALL file.  I had missed it.  Additionally, the file says:

"There are no special requirements on kernel features..."

But CONFIG_BTRFS_FS is needed at a minimum for btrfs to run at all.
Comment 8 Bruce Dubbs 2016-11-08 03:11:46 UTC
After looking at the test scripts, I believe I have found the problem.

In convert-tests/005-delete-all-rollback/test.sh there are lines:

convert_test_post_check "$CHECKSUMTMP"

This function is in common.convert and needs three parameters:

   CHECKSUMTMP="$1"
   EXT_PERMTMP="$2"
   EXT_ACLTMP="$3"

convert_test_post_check calls 
  convert_test_post_check_permissions "$EXT_PERMTMP"

but "$EXT_PERMTMP" is null.

convert_test_post_check_permissions() has
  EXT_PERMTMP="$1"
  
and later
  ext_perm_file=`md5sum $EXT_PERMTMP | cut -f2 -d' '`
  
since $EXT_PERMTMP is null, md5sum hangs forever/

The same problem is in the convert_test_post_check_acl() function due to $BTRFS_ACLTMP being null.

I do not see how this test can possibly complete, let alone pass.
Comment 9 Bruce Dubbs 2016-11-08 04:06:36 UTC
There are a couple of errors identified in fuzz-tests.sh

In tests/003-multi-check-unmounted/test.sh  there are calls to run_mayfail in tests/common.  The comments in that file indicate that the check is run, but not failed.  That's not true.  I got numerous errors like:

mayfail: returned code 134 (SIGABRT), not ignored

The run_mayfail() function has:

      if [ $ret == 139 ]; then
         _fail "mayfail: returned code 139 (SEGFAULT), not ignored"
      elif [ $ret == 134 ]; then
         _fail "mayfail: returned code 134 (SIGABRT), not ignored"
      fi

It appears that _fail should be _log.  I do not know why 
btrfs check -s 1 "$image" returns a code 134.  I am simply running
./fuzz-tests.sh.

I also got a major error in  [TEST/fuzz]   007-simple-super-recover

    [TEST/fuzz]   007-simple-super-recover
*** Error in `/tmp/btrfs/btrfs-progs-v4.8.2/btrfs': double free or corruption (!prev): 0x0000000001e9b010 ***
======= Backtrace: =========
/lib/libc.so.6(+0x70c4b)[0x7f5b00c35c4b]
/lib/libc.so.6(+0x76fe6)[0x7f5b00c3bfe6]
/lib/libc.so.6(+0x777de)[0x7f5b00c3c7de]
/tmp/btrfs/btrfs-progs-v4.8.2/btrfs(btrfs_close_devices+0xf2)[0x45a5d4]
/tmp/btrfs/btrfs-progs-v4.8.2/btrfs(btrfs_recover_superblocks+0x37a)[0x439e6c]
/tmp/btrfs/btrfs-progs-v4.8.2/btrfs[0x43450c]
/tmp/btrfs/btrfs-progs-v4.8.2/btrfs(handle_command_group+0x44)[0x40a812]
/tmp/btrfs/btrfs-progs-v4.8.2/btrfs(cmd_rescue+0x15)[0x434980]
/tmp/btrfs/btrfs-progs-v4.8.2/btrfs(main+0x86)[0x40a8b6]
/lib/libc.so.6(__libc_start_main+0xf1)[0x7f5b00be5291]
/tmp/btrfs/btrfs-progs-v4.8.2/btrfs(_start+0x2a)[0x40a57a]

I am using gcc-6.2.0 and libc-2.24.  Could one of these be causing the test failures?
Comment 10 David Sterba 2016-11-10 18:41:26 UTC
I'm updating the docs to be more interlinked, the tests' README is referenced from README, as INSTALL is IMO just for the userspace tools. The minimal config requirements will be also documented.

Great that you found why the 005 ext4 test hangs, I'm going to fix that. As the convert tests run for a very long time I must have missed that 005 does not finish at all.

The fuzz tests are known to be failing for now, the code changes required to fix them are more invasive and all over the codebase.

The _fail in run_mayfail is intentional, the mayfail logic concerns only normal operation of the utilities. If it fails, it should be all properly handled, so the abort and segfault are considered proper in this case and stop the test. I'll update the comments to reflect that.

Turned out to be a very productive bugreport, thanks!