Bug 87021

Summary: Unrecoverable btrfs corruption with 3.17
Product: File System Reporter: kernel-bugzilla
Component: btrfsAssignee: Josef Bacik (josef)
Status: RESOLVED CODE_FIX    
Severity: high CC: dpisklov, dsterba, hugo, isntall.us, szg00000
Priority: P1    
Hardware: x86-64   
OS: Linux   
Kernel Version: 3.17.1 Subsystem:
Regression: No Bisected commit-id:
Attachments: signature.asc

Description kernel-bugzilla 2014-10-27 22:22:19 UTC
I've hit a bug that caused my btrfs root filesystem to become unrecoverably corrupt two times now, both times shortly after upgrading to 3.17. Both times, btrfs complained about an I/O error in the dmesg (unfortunately, I can't provide a dump of the dmesg, as the copy I've made is a 0 byte file for reasons I don't yet understand, but probably related to it being copied from a broken system with read-only root fs). A recent btrfsck from btrfs-tools 3.17 is not able to check nor repair it:

$ btrfsck image.img 
Check tree block failed, want=26604240896, have=0
Check tree block failed, want=26604240896, have=0
Check tree block failed, want=26604240896, have=0
read block failed check_tree_block
Couldn't read tree root
Couldn't open file system

btrfs-image fails with exactly the same error messages:

$ btrfs-image image.img  /tmp/test.foo
Check tree block failed, want=26604240896, have=0
Check tree block failed, want=26604240896, have=0
Check tree block failed, want=26604240896, have=0
read block failed check_tree_block
Couldn't read tree root
Open ctree failed
create failed (Success)

and of course, the FS can't be mounted either, dmesg:

BTRFS: bad tree block start 0 26604240896
BTRFS: failed to read tree root on sda2
BTRFS: open_ctree failed
Comment 1 kernel-bugzilla 2014-10-29 14:38:55 UTC
Some additional info that might be helpful and I forgot to mention:
 - The filesystem was mounted with -o discard,compress=lzo
 - I did daily read-only snapshots and hat - at the time of the crash - about 10 of them
Comment 2 Hugo Mills 2014-11-16 15:12:16 UTC
Can you give more information about the configuration of the filesystem?

Specifically:

 - What RAID configuration did you have on the FS, if any?
 - Is this on an SSD? (I assume so, given you're using discard)
 - What SSD hardware?
 - What SATA controller hardware?
Comment 3 kernel-bugzilla 2014-11-17 09:46:49 UTC
- RAID: I didn't use any btrfs RAID features. Also, the filesystem was located directly on a physical partition, no device-mapper.
- SSD: Yes, it's a Samsung 840 Evo 250GB.
- AHCI controller: 00:1f.2 SATA controller: Intel Corporation 8 Series SATA Controller 1 [AHCI mode] (rev 04)
Comment 4 dpisklov 2014-11-17 13:06:24 UTC
Have a look here, I had exactly same issue (BTW also on Samsung's SSD), and this bug has solution in comments:
https://bugzilla.kernel.org/show_bug.cgi?id=72151 (I marked my bug as duplicate of that one, suggest you do the same)
Solution is here:
http://www.spinics.net/lists/linux-btrfs/msg36714.html
Comment 5 Hugo Mills 2014-11-17 13:36:27 UTC
Created attachment 157851 [details]
signature.asc

On Mon, Nov 17, 2014 at 01:06:24PM +0000, bugzilla-daemon@bugzilla.kernel.org wrote:
> https://bugzilla.kernel.org/show_bug.cgi?id=87021
> 
> dpisklov@gmail.com changed:
> 
>            What    |Removed                     |Added
> ----------------------------------------------------------------------------
>                  CC|                            |dpisklov@gmail.com
> 
> --- Comment #4 from dpisklov@gmail.com ---
> Have a look here, I had exactly same issue (BTW also on Samsung's SSD), and
> this bug has solution in comments:
> https://bugzilla.kernel.org/show_bug.cgi?id=72151 (I marked my bug as
> duplicate
> of that one, suggest you do the same)

   No, this is *not* the same bug. Please take the duplicate marker
off.

   The important thing in this bug and the one you erroneously marked
as a duplicate is the "have=0". That implies that the metadata is
being zeroed, rather than overwritten by random junk, which is that's
happened with your FS. We suspect that trim is involved in both cases,
but that's clearly not the case with your bug.

   Hugo.

> Solution is here:
> http://www.spinics.net/lists/linux-btrfs/msg36714.html
>
Comment 6 dpisklov 2014-11-17 13:40:46 UTC
(In reply to Hugo Mills from comment #5)
>    No, this is *not* the same bug. Please take the duplicate marker
> off.
> 
>    The important thing in this bug and the one you erroneously marked
> as a duplicate is the "have=0". That implies that the metadata is
> being zeroed, rather than overwritten by random junk, which is that's
> happened with your FS. We suspect that trim is involved in both cases,
> but that's clearly not the case with your bug.

The bug I refer to also has have=18446744073709551615 (and not 0 as in my case) however same fix helped me. So this particular bug is indeed same as 72151... And I guess difference in my case is that I use discard instead of trim.