Bug 87061

Summary: BTRF disk corrupted after upgrade to kernel 3.17.1, the disk is now unmountable
Product: File System Reporter: isntall.us
Component: btrfsAssignee: Josef Bacik (josef)
Status: RESOLVED OBSOLETE    
Severity: high CC: david.gabriel, dsterba, hugo, isntall.us, kernel-bugzilla, leho, szg00000
Priority: P1    
Hardware: Intel   
OS: Linux   
Kernel Version: 3.17.1 Subsystem:
Regression: No Bisected commit-id:
Attachments: btrfs-show-super
btrfs show super with -afF

Description isntall.us 2014-10-28 04:04:45 UTC
Created attachment 155581 [details]
btrfs-show-super

I recently upgraded to the Linux kernel 3.17.1 and I was running snapper to take hourly snapshots of my disk. It only took a couple hours before the disk became read-only, once I realized there was an issue i shutdown the instance to try to stop any further degradation.



Here are some outputs from btrfs-progs built from the git repo



[root@lep mnt]# /home/user/git-repos/btrfs-progs/btrfs check /dev/sde3
Check tree block failed, want=1163613192192, have=0
Check tree block failed, want=1163613192192, have=0
Check tree block failed, want=1163613192192, have=0
read block failed check_tree_block
Couldn't read tree root
Couldn't open file system



[root@lep mnt]# mount -o recovery,ro /dev/sde3 /mnt/btrfs-mount/
mount: wrong fs type, bad option, bad superblock on /dev/sde3,
       missing codepage or helper program, or other error

       In some cases useful info is found in syslog - try
       dmesg | tail or so.
[root@lep mnt]#  dmesg | tail
[247224.534827] BTRFS: bad tree block start 0 1163613192192
[247224.534833] BTRFS: failed to read tree root on sde3
[247224.543994] parent transid verify failed on 1163608866816 wanted 176180 found 176182
[247224.543999] BTRFS: failed to read tree root on sde3
[247224.547683] BTRFS: bad tree block start 0 1163608342528
[247224.547697] BTRFS: failed to read tree root on sde3
[247224.556026] BTRFS: bad tree block start 0 1163606458368
[247224.556041] BTRFS: failed to read tree root on sde3
[247224.612869] BTRFS: open_ctree failed



[root@lep mnt]# /home/ser/git-repos/btrfs-progs/btrfs-find-root  /dev/sde3
Super think's the tree root is at 1163613192192, chunk root 1177076924416



My issueseems very similar to this issue https://bugzilla.kernel.org/show_bug.cgi?id=87021

p.s. After I realized there was an issue i made a block-level disk image with dd
Comment 1 isntall.us 2014-10-28 04:10:32 UTC
Created attachment 155591 [details]
btrfs show super with -afF
Comment 2 David Gabriel 2014-10-28 19:31:32 UTC
I might have hit the same bug too. Running Arch Linux with 3.17.1-1-ARCH, used snapper too. When my disk space was running low (~90% full) I attempted a rebalance that segfaulted. After the segfault the btrfs device hung and I had to reset, as btrfs was in proc state 'D' and the computer wouldn't let me do a regular shutdown. Afterwards btrfs was corrupted and I couldn't restore the data or mount the file system. Neither btrfs-zero-log nor 'btrfs recover' did work.

I managed to record the dmesg output before I reset the computer:

~ # btrfs balance start -dusage=85 /home/
Segmentation fault

~ # btrfs balance start -dusage=85 /home/
ERROR: error during balancing '/home/' - Operation now in progress # ??

~ # btrfs balance status -v /home
Balance on '/home' is running
0 out of about 22 chunks balanced (1 considered), 100% left                                                                                                              
Dumping filters: flags 0x1, state 0x1, force is off                                                                                                                      
  DATA (flags 0x2): balancing, usage=85                                                                                                                                  
                                                                                                                                                                         
~ # btrfs balance cancel /home
# comment: I/O wait, process in D state, home hangs. after recording the stacktrace I had to reboot

### stacktrace caused by btrfs rebalance segfault ###
Segfault:
[77589.917723] BTRFS: could not find root 8
[78380.602881] BTRFS info (device sda3): qgroup scan completed
[463110.059406] BTRFS info (device sda3): relocating block group 100952702976
flags 1
[463117.474700] BTRFS info (device sda3): found 4614 extents
[463121.953892] ------------[ cut here ]------------
[463121.953896] kernel BUG at fs/btrfs/relocation.c:931!
[463121.953897] invalid opcode: 0000 [#1] PREEMPT SMP 
[463121.953899] Modules linked in: btrfs xor raid6_pq pci_stub vboxpci(O) vboxnetflt(O) vboxnetadp(O) nfsv3 nfs_acl arc4 ecb md4 md5 hmac nls_utf8 cifs dns_resolver cfg80211 rfkill w83627ehf hwmon_vid ext4 crc16 mbcache jbd2 snd_hda_codec_analog snd_hda_codec_generic snd_hda_codec_hdmi nvidia(PO) snd_usb_audio snd_usbmidi_lib joydev snd_rawmidi mousedev snd_seq_device iTCO_wdt iTCO_vendor_support gpio_ich evdev mxm_wmi mac_hid coretemp hwmon kvm_intel kvm psmouse serio_raw snd_hda_intel snd_hda_controller i2c_i801 snd_hda_codec r8169 snd_hwdep mii snd_pcm lpc_ich snd_timer snd i7core_edac soundcore edac_core wmi shpchp acpi_cpufreq processor button vboxvideo(O) drm i2c_core vboxdrv(O) usb_storage fuse nfs lockd sunrpc fscache sd_mod sr_mod crc_t10dif cdrom crct10dif_common ata_generic pata_acpi hid_generic
[463121.953930]  usbhid hid atkbd libps2 ahci pata_jmicron libahci libata ehci_pci uhci_hcd firewire_ohci ehci_hcd scsi_mod firewire_core crc_itu_t usbcore usb_common i8042 serio xfs crc32c_generic crc32c_intel libcrc32c
[463121.953941] CPU: 0 PID: 4292 Comm: btrfs Tainted: P          IO   3.17.1-1-ARCH #1
[463121.953942] Hardware name: System manufacturer System Product Name/Rampage II GENE, BIOS 1701    09/19/2011
[463121.953944] task: ffff8804e4713250 ti: ffff88011e004000 task.ti: ffff88011e004000
[463121.953945] RIP: 0010:[<ffffffffa12f7584>]  [<ffffffffa12f7584>] build_backref_tree+0xf94/0x1270 [btrfs]
[463121.953957] RSP: 0018:ffff88011e0078d0  EFLAGS: 00010287
[463121.953958] RAX: ffff88012505c800 RBX: ffff8803729e4d80 RCX: ffff8805226208c0
[463121.953959] RDX: ffff880522620300 RSI: 0000001736c74000 RDI: ffff88012505c800
[463121.953960] RBP: ffff88011e0079c0 R08: 0000000000017420 R09: ffff880522620300
[463121.953961] R10: 0000000000000000 R11: 0000000000000001 R12: ffff880522620880
[463121.953962] R13: ffff8802ad22b370 R14: ffff880522620900 R15: ffff88011b157800
[463121.953963] FS:  00007f76d1a718c0(0000) GS:ffff88053fc00000(0000) knlGS:0000000000000000
[463121.953964] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
[463121.953965] CR2: 000007fffffd9478 CR3: 0000000128aee000 CR4: 00000000000027f0
[463121.953966] Stack:
[463121.953967]  ffff880522620300 01ffffffa12f81a1 ffff880522620900 ffff8803729e4d80
[463121.953969]  ffff880522620900 ffff88012505c800 ffff8804dc2f0a60 ffff880500000003
[463121.953970]  ffff880522620940 ffff88011b157920 ffff88011b157924 ffff8802ad22b360
[463121.953972] Call Trace:
[463121.953979]  [<ffffffffa12f94a4>] relocate_tree_blocks+0x214/0x650 [btrfs]
[463121.953985]  [<ffffffffa12f8968>] ? add_data_references+0x278/0x2a0 [btrfs]
[463121.953991]  [<ffffffffa12fa76d>] relocate_block_group+0x26d/0x6d0 [btrfs]
[463121.953997]  [<ffffffffa12fadb6>] btrfs_relocate_block_group+0x1e6/0x2f0 [btrfs]
[463121.954003]  [<ffffffffa12cec93>] btrfs_relocate_chunk.isra.28+0x63/0x740 [btrfs]
[463121.954007]  [<ffffffffa127c9a1>] ? btrfs_set_path_blocking+0x41/0x80 [btrfs]
[463121.954011]  [<ffffffffa1281cad>] ? btrfs_search_slot+0x4dd/0xa70 [btrfs]
[463121.954018]  [<ffffffffa12bf959>] ? btrfs_get_token_64+0x119/0x140 [btrfs]
[463121.954022]  [<ffffffffa1285c74>] ? block_group_cache_tree_search+0xc4/0xf0 [btrfs]
[463121.954028]  [<ffffffffa12d21cb>] btrfs_balance+0x98b/0xf50 [btrfs]
[463121.954035]  [<ffffffffa12d9649>] btrfs_ioctl_balance+0x169/0x3c0 [btrfs]
[463121.954041]  [<ffffffffa12ded08>] btrfs_ioctl+0x588/0x28c0 [btrfs]
[463121.954044]  [<ffffffff81160937>] ? lru_cache_add_active_or_unevictable+0x27/0xa0
[463121.954048]  [<ffffffff811826f0>] ? handle_mm_fault+0xa90/0x1100
[463121.954050]  [<ffffffff811d2632>] ? final_putname+0x22/0x50
[463121.954052]  [<ffffffff811d7972>] ? user_path_at_empty+0x72/0xd0
[463121.954055]  [<ffffffff8105e9ac>] ? __do_page_fault+0x2ec/0x600
[463121.954056]  [<ffffffff811860d1>] ? vma_link+0xc1/0xd0
[463121.954058]  [<ffffffff811ab47a>] ? kmem_cache_alloc+0x16a/0x170
[463121.954060]  [<ffffffff811da020>] do_vfs_ioctl+0x2d0/0x4b0
[463121.954062]  [<ffffffff811da281>] SyS_ioctl+0x81/0xa0
[463121.954065]  [<ffffffff8153c829>] system_call_fastpath+0x16/0x1b
[463121.954066] Code: fe ff ff 48 8b 55 90 48 8d 48 10 48 8d 75 88 48 89 4d 90 48 89 70 10 48 89 50 18 48 89 0a 48 8b 00 4c 39 e8 75 dd e9 9e fe ff ff <0f> 0b 49 8d 44 24 20 48 89 9d 50 ff ff ff 4c 89 bd 70 ff ff ff
[463121.954082] RIP  [<ffffffffa12f7584>] build_backref_tree+0xf94/0x1270 [btrfs]
[463121.954088]  RSP <ffff88011e0078d0>
[463121.954089] ---[ end trace c3ca8db080989b0d ]---
Comment 3 Hugo Mills 2014-11-16 15:18:01 UTC
Can you give more information about the configuration of the filesystem?

Specifically:

 - What RAID configuration did you have on the FS, if any?
 - Is this on an SSD? (I assume so, given you're using discard)
 - What SSD hardware?
 - What SATA controller hardware?
Comment 4 isntall.us 2014-11-17 02:10:57 UTC
It was a standalone FS with some mount options on an SSD. The SSD had 3 partitions boot, swap, and root fs. As mentioned above, I was using kernel 3.17.1. 

no raid

SSD yes. 
Unfortunately for me, i didn't write down the mount flags.
Though it should have had defaults,ssd,compress=lzo,discard,subvol=
probably one or two more related to cache

240 GB Intel 530 Series Solid State Drive

Intel X79 chipset for the SATA controller and the motherboard was the Asus P9X79 LE
Comment 5 David Gabriel 2014-11-17 09:42:33 UTC
Same here, no RAID, just a 'standard' primary partition with btrfs on a SSD. I was running this setup for about 14 days, did nothing fancy, only snapper taking snapshots every now and then.

Mountflags where taken straight from the arch linux wiki (SSD / performance recommendation):

noatime,discard,ssd,autodefrag,compress=lzo,space_cache

Hardware:
 - Samsung 840 Pro 256GB
 - MB: Asus Rampage II GENE, ICH10 SATA Controller
Comment 6 David Sterba 2022-10-04 08:11:12 UTC
This is a semi-automated bugzilla cleanup, report is against an old kernel version. If the problem still happens, please open a new bug. Thanks.