Bug 199707
Summary: | kernel oops with "BTRFS: decompress failed" in 2 NoCOW files | ||
---|---|---|---|
Product: | File System | Reporter: | jamespharvey20 |
Component: | btrfs | Assignee: | BTRFS virtual assignee (fs_btrfs) |
Status: | RESOLVED OBSOLETE | ||
Severity: | blocking | CC: | dsterba |
Priority: | P1 | ||
Hardware: | All | ||
OS: | Linux | ||
Kernel Version: | 4.16.8 | Subsystem: | |
Regression: | No | Bisected commit-id: | |
Attachments: |
Crash with BTRFS decompress failed
Crash with general protection fault filefrag full output Crash with BTRFS decompress failed |
Description
jamespharvey20
2018-05-13 11:42:40 UTC
Created attachment 275949 [details]
Crash with BTRFS decompress failed
Created attachment 275951 [details]
Crash with general protection fault
Additional details seeming most pertinent. There are 2 files that trigger a crash: -rw-r-----+ 1 root 190 16777216 Oct 1 2016 system@00fa3c0596e64d2e84096520ca46f008-0000000000000001-00053cd2c1756577.journal -rw-r-----+ 1 root 190 8388608 Oct 1 2016 user-1000@b70add0ef010457d933fec23a2afa48a-0000000000000495-00053b6b6e65e9cf.journal lsattr shows: ---------------C-- system@00fa3c0596e64d2e84096520ca46f008-0000000000000001-00053cd2c1756577.journal ---------------C-- user-1000@b70add0ef010457d933fec23a2afa48a-0000000000000495-00053b6b6e65e9cf.journal (Focusing here on 1 of these files.) filefrag -v user-1000@b70add0ef010457d933fec23a2afa48a-0000000000000495-00053b6b6e65e9cf.journal (Full output is attached) ... 59 extents found For EACH of the 59 extents: * btrfs-map-logical -l [FILEFRAG'S STARTING PHYSICAL OFFSET NUMBER * 4096 FOR BLOCKSIZE] -b 4096 -o frag[FRAG NUM].1 -c 1 /dev/lvm/newMain1 * btrfs-map-logical -l [FILEFRAG'S STARTING PHYSICAL OFFSET NUMBER * 4096 FOR BLOCKSIZE] -b 4096 -o frag[FRAG NUM].2 -c 2 /dev/lvm/newMain1 * diff --brief frag[FRAG NUM].1 frag[FRAG NUM].2 (Except on the last extent, "-b 847872") If all of these matched, there could still be compression corruption, but it would have had to have happened before being written to disk. If some or all of these didn't match, one of the disk copies got corrupted. It turns out fragments [0-27], [29-39], and [56-68] match. But, fragments 28, and [40-55] are completely different. Notably, btrfs-map-logical isn't crashing, because it's giving data in its compressed form, so isn't tripping up on invalid compressed data. Regarding reading 4096 for each fragment. journald files start with ASCII "LPKSHHRH". I did the first fragment, and found it had an extra 9 byte header before LPKSHHRH, of "3a0c 0000 6b02 0000 0a". I'm assuming that's a btrfs-lzo header. After LPKSHHRH is about 2k of binary data, and zeros. If I split out the first 128k of the uncompressed valid file and run lzop on it, it winds up about 2k, so this compression ratio is realistic. filefrag doesn't seem to be aware of compression, and shows the ending offsets and length based on uncompressed size. Not knowing each fragment's actual size, I decided to just grab the first 4k, expecting them all to be compressed within that space, except for the last one which was much larger, which I took its whole size. This means there could be a 128k fragment taking more than 4k of disk space, so there could be more differences between the mirrored copies than I've discovered. Created attachment 275953 [details]
filefrag full output
Probably other related cases: https://www.spinics.net/lists/linux-btrfs/msg60025.html User also sometimes had "Fixing recursive fault but reboot is needed!" style crashes (which also appears in my general protection fault crashes), and sometimes "BTRFS: decompress failed". No resolution. https://www.spinics.net/lists/linux-btrfs/msg52218.html User also has "BUG: unable to handle kernel paging request" style crashes (which is the first line in my BTRFS decompress failed crashes.) No resolution. Created attachment 275961 [details]
Crash with BTRFS decompress failed
Adding because it's different than the other BTRFS decompress failed I attached. This one has "BTRFS: decompress failed" as the first line, and is followed by "BUG: unable to handle kernel NULL pointer dereference at 0000000000000001".
(While testing all of my no checksum files for inconsistencies, mounting degraded to get access to mirrored copies that weren't being read with all disks, ran across another corrupt journald file on disk1.)
This is a semi-automated bugzilla cleanup, report is against an old kernel version. If the problem still happens, please open a new bug. Thanks. |