Bug 201453 - Bug 1640090 - [xfstests xfs/490]: xfs_db print a bad (negative number) as agi freecount
Summary: Bug 1640090 - [xfstests xfs/490]: xfs_db print a bad (negative number) as agi...
Status: NEW
Alias: None
Product: File System
Classification: Unclassified
Component: XFS (show other bugs)
Hardware: All Linux
: P1 normal
Assignee: FileSystem/XFS Default Virtual Assignee
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2018-10-17 11:20 UTC by Zorro Lang
Modified: 2018-10-24 20:10 UTC (History)
1 user (show)

See Also:
Kernel Version: v4.18
Subsystem:
Regression: No
Bisected commit-id:


Attachments
xfs metadump (97.43 KB, application/gzip)
2018-10-17 17:49 UTC, Zorro Lang
Details

Description Zorro Lang 2018-10-17 11:20:42 UTC
Description of problem:
On s390x, I hit a xfs/490 failure (can't reproduce it on x86_64). By manually debuging, I find:

# mkfs.xfs -f -m finobt=0 /dev/loop1                                             
meta-data=/dev/loop1             isize=512    agcount=4, agsize=786496 blks                               
         =                       sectsz=512   attr=2, projid32bit=1                                       
         =                       crc=1        finobt=0, sparse=1, rmapbt=0                                
         =                       reflink=1
data     =                       bsize=4096   blocks=3145984, imaxpct=25                                  
         =                       sunit=0      swidth=0 blks                                               
naming   =version 2              bsize=4096   ascii-ci=0, ftype=1                                         
log      =internal log           bsize=4096   blocks=2560, version=2                                      
         =                       sectsz=512   sunit=0 blks, lazy-count=1                                  
realtime =none                   extsz=4096   blocks=0, rtextents=0                                       
# mount /dev/loop1 /mnt/testarea/scratch/                                        
# mkdir /mnt/testarea/scratch/dir                                                
# xfs_io -fc "pwrite 0 4096" -c fsync /mnt/testarea/scratch/dir/testfile         
wrote 4096/4096 bytes at offset 0
4 KiB, 1 ops; 0.0000 sec (150.240 MiB/sec and 38461.5385 ops/sec)                                  
# stat -c %i /mnt/testarea/scratch/dir/testfile                                  
132
# umount /dev/loop1                                                              
# _scratch_xfs_db -c "convert inode 132 agno"                                    
0x0 (0)
# _scratch_xfs_get_metadata_field "recs[1].freecount" "agi 0" "addr root"        
-197
]# xfs_db -c "agi 0" -c "addr root" -c "print recs[1]" /dev/loop1                 
recs[1] = [startino,holemask,count,freecount,free]
1:[128,0,64,-197,0xffffffffffffffe0]


Version-Release number of selected component (if applicable):
kernel 4.18
xfsprogs 4.19.0-rc0

How reproducible:
100% on s390x with loop device (at least from my testing)

Steps to Reproduce:
run xfs/490 on s390x

Actual results:
as above

Expected results:
test pass

Additional info:
I think it's not a kernel problem, the negative number maybe not real on disk. Due to the SCRATCH_DEV still can be mounted without errors:

[root@ibm-z-110 xfstests]# mount /dev/loop1 /mnt/testarea/scratch
[root@ibm-z-110 xfstests]# dmesg|tail
[ 7289.790976] XFS (loop1): Mounting V5 Filesystem
[ 7289.796601] XFS (loop1): Ending clean mount

And it's not reproducible on x86_64:
# xfs_db -c "agi 0" -c "addr root" -c "print recs[1]" /dev/loop1
recs[1] = [startino,holemask,count,freecount,free] 
1:[128,0,64,59,0xffffffffffffffe0]
Comment 1 Zorro Lang 2018-10-17 17:49:33 UTC
Created attachment 279077 [details]
xfs metadump
Comment 2 Dave Chinner 2018-10-18 01:41:55 UTC
On Wed, Oct 17, 2018 at 11:20:42AM +0000, bugzilla-daemon@bugzilla.kernel.org wrote:
> # _scratch_xfs_get_metadata_field "recs[1].freecount" "agi 0" "addr root"     
> -197
> ]# xfs_db -c "agi 0" -c "addr root" -c "print recs[1]" /dev/loop1             
> recs[1] = [startino,holemask,count,freecount,free]
> 1:[128,0,64,-197,0xffffffffffffffe0]
.....
> And it's not reproducible on x86_64:
> # xfs_db -c "agi 0" -c "addr root" -c "print recs[1]" /dev/loop1
> recs[1] = [startino,holemask,count,freecount,free] 
> 1:[128,0,64,59,0xffffffffffffffe0]

-197 = -(256 - 59)

This looks like a sign extension problem in the xfs_db code. s390 is
a big endian system, right?

Cheers,

Dave.
Comment 3 Eric Sandeen 2018-10-18 01:50:58 UTC
Yep, does this fix it?

diff --git a/db/btblock.c b/db/btblock.c
index cbd2990..5a5b061 100644
--- a/db/btblock.c
+++ b/db/btblock.c
@@ -513,7 +513,7 @@ const field_t       inobt_sprec_flds[] = {
        { "holemask", FLDT_UINT16X, OI(ROFF(ir_u.sp.ir_holemask)), C1, 0,
          TYP_NONE },
        { "count", FLDT_UINT8D, OI(ROFF(ir_u.sp.ir_count)), C1, 0, TYP_NONE },
-       { "freecount", FLDT_INT8D, OI(ROFF(ir_u.sp.ir_freecount)), C1, 0,
+       { "freecount", FLDT_UINT8D, OI(ROFF(ir_u.sp.ir_freecount)), C1, 0,
          TYP_NONE },
        { "free", FLDT_INOFREE, OI(ROFF(ir_free)), C1, 0, TYP_NONE },
        { NULL }
Comment 4 Zorro Lang 2018-10-18 06:25:29 UTC
(In reply to Eric Sandeen from comment #3)
> Yep, does this fix it?

Yes, this's helpful.
# xfs_db -c "agi 0" -c "addr root" -c "print recs[1]" /dev/loop1
recs[1] = [startino,holemask,count,freecount,free]
1:[64,0,64,59,0xffffffffffffffe0]


> 
> diff --git a/db/btblock.c b/db/btblock.c
> index cbd2990..5a5b061 100644
> --- a/db/btblock.c
> +++ b/db/btblock.c
> @@ -513,7 +513,7 @@ const field_t       inobt_sprec_flds[] = {
>         { "holemask", FLDT_UINT16X, OI(ROFF(ir_u.sp.ir_holemask)), C1, 0,
>           TYP_NONE },
>         { "count", FLDT_UINT8D, OI(ROFF(ir_u.sp.ir_count)), C1, 0, TYP_NONE
> },
> -       { "freecount", FLDT_INT8D, OI(ROFF(ir_u.sp.ir_freecount)), C1, 0,
> +       { "freecount", FLDT_UINT8D, OI(ROFF(ir_u.sp.ir_freecount)), C1, 0,
>           TYP_NONE },
>         { "free", FLDT_INOFREE, OI(ROFF(ir_free)), C1, 0, TYP_NONE },
>         { NULL }
Comment 5 Eric Sandeen 2018-10-24 20:10:39 UTC
So zorro correctly points out that the big vs little endian certainly should not matter for this u8.

What does matter is the signed type, because getbitval is doing tricks to try to handle sign extension and it does it differently for big vs. little endian:

                if (getbit_l(p, bit + i)) {
                        /* If the last bit is on and we care about sign
                         * bits and we don't have a full 64 bit
                         * container, turn all bits on between the
                         * sign bit and the most sig bit.
                         */

                        /* handle endian swap here */
#if __BYTE_ORDER == LITTLE_ENDIAN
                        if (i == 0 && signext && nbits < 64)
                                rval = (~0ULL) << nbits;
                        rval |= 1ULL << (nbits - i - 1);
#else
                        if ((i == (nbits - 1)) && signext && nbits < 64)
                                rval |= ((~0ULL) << nbits);
                        rval |= 1ULL << (nbits - i - 1);
#endif

Switching it to FLDT_UINT8D makes "signext" false so none of this happens, but that's papering over the underlying bug with signed types.

The bug seems to be the test for if ((i == (nbits - 1)) ...) - this is testing the last / rightmost bit in the number, which is /not/ the MSB.

But I cannot seem to wrap my head around the right way to fix it, yet.

Note You need to log in before you can comment on or make changes to this bug.