Bug 59451

Summary: Bug in btrfs_search_old_slot triggered when defragmenting
Product: File System Reporter: Clemens Eisserer (linuxhippy)
Component: btrfsAssignee: liubo (bo.li.liu)
Status: RESOLVED CODE_FIX    
Severity: normal CC: bo.li.liu, dsterba, jnelson-kernel-bugzilla, josef, kunc88, vovan
Priority: P1    
Hardware: All   
OS: Linux   
Kernel Version: 3.10.0.rc5 Subsystem:
Regression: No Bisected commit-id:
Attachments: full dmesg
dmesg of 3.1rc5
fix ulist_add_merge crash
BTRFS warning at fs/btrfs/inode.c:2206 record_one_backref+0x371/0x3f0()

Description Clemens Eisserer 2013-06-08 08:54:11 UTC
This was bug 962014 ( https://bugzilla.redhat.com/show_bug.cgi?id=962014 ) reported at RedHat's Bugzilla. I thought moving it to the kernel-bugzilla would make it easier to keep track of it.

When manually defragmenting Firefox's places file (~80mb, about 1500 extents) I hit a kernel-bug with 3.9.4 as well as with 3.10.0.rc4:

[   39.762340]  [<ffffffffa01644ce>] btrfs_search_old_slot+0x56e/0x8d0 [btrfs]
[   39.762384]  [<ffffffffa01dacd1>] __resolve_indirect_refs+0x141/0x5d0 [btrfs]
[   39.762428]  [<ffffffffa019df51>] ? mark_extent_buffer_accessed+0x51/0x60 [btrfs]
[   39.762471]  [<ffffffffa017cf7c>] ? btree_read_extent_buffer_pages.constprop.56+0xcc/0x120 [btrfs]
[   39.762522]  [<ffffffffa01a4d1c>] ? read_extent_buffer+0xbc/0x110 [btrfs]
[   39.762563]  [<ffffffffa01db51c>] find_parent_nodes+0x3bc/0xf90 [btrfs]
[   39.763922]  [<ffffffffa0163b4f>] ? btrfs_search_slot+0x55f/0x970 [btrfs]
[   39.765285]  [<ffffffffa01a3c0f>] ? free_extent_buffer+0x4f/0xa0 [btrfs]
[   39.766646]  [<ffffffffa01dc189>] btrfs_find_all_roots+0x99/0x100 [btrfs]
[   39.768004]  [<ffffffffa0188b50>] ? can_nocow_odirect+0x2f0/0x2f0 [btrfs]
[   39.769370]  [<ffffffffa01dcadf>] iterate_extent_inodes+0x14f/0x320 [btrfs]
[   39.770748]  [<ffffffffa0188b50>] ? can_nocow_odirect+0x2f0/0x2f0 [btrfs]
[   39.772126]  [<ffffffffa01dcd37>] iterate_inodes_from_logical+0x87/0xa0 [btrfs]
[   39.773502]  [<ffffffffa0186a0b>] record_extent_backrefs+0x7b/0xd0 [btrfs]
[   39.774874]  [<ffffffffa0191af6>] btrfs_finish_ordered_io+0x156/0x960 [btrfs]
[   39.776246]  [<ffffffffa01924e5>] finish_ordered_fn+0x15/0x20 [btrfs]
[   39.777623]  [<ffffffffa01b10cb>] worker_loop+0x12b/0x500 [btrfs]
[   39.779002]  [<ffffffffa01b0fa0>] ? btrfs_queue_worker+0x300/0x300 [btrfs]
[   39.780377]  [<ffffffff81080aa0>] kthread+0xc0/0xd0
[   39.781747]  [<ffffffff810809e0>] ? insert_kthread_work+0x40/0x40
[   39.783122]  [<ffffffff8164f7ac>] ret_from_fork+0x7c/0xb0
[   39.784455]  [<ffffffff810809e0>] ? insert_kthread_work+0x40/0x40
Comment 1 Clemens Eisserer 2013-06-08 08:54:45 UTC
Created attachment 103951 [details]
full dmesg
Comment 2 Clemens Eisserer 2013-06-14 11:54:09 UTC
Created attachment 104731 [details]
dmesg of 3.1rc5
Comment 3 Clemens Eisserer 2013-06-14 11:54:55 UTC
would it help if I somehow could provide an image of the fs which shows the mentioned behaviour?
Comment 4 Jon Nelson 2013-06-22 14:55:11 UTC
A "me too".

After this, the CPU gets "stuck" and I have to reboot.


[360491.932226] ------------[ cut here ]------------
[360491.932261] kernel BUG at
/home/abuild/rpmbuild/BUILD/kernel-desktop-3.9.6/linux-3.9/fs/btrfs/ctree.c:1144!
[360491.932312] invalid opcode: 0000 [#1] PREEMPT SMP
[360491.932344] Modules linked in: xfs nilfs2 jfs usb_storage
nls_iso8859_1 nls_cp437 vfat fat mmc_block nfsv4 auth_rpcgss nfs
fscache lockd sunrpc tun snd_usb_audio snd_usbmidi_lib snd_rawmidi
snd_seq_device fuse xt_tcpudp xt_pkttype xt_LOG xt_limit af_packet
ip6t_REJECT nf_conntrack_ipv6 nf_defrag_ipv6 ip6table_raw ipt_REJECT
iptable_raw xt_CT iptable_filter ip6table_mangle
nf_conntrack_netbios_ns nf_conntrack_broadcast nf_conntrack_ipv4
nf_defrag_ipv4 ip_tables xt_conntrack nf_conntrack ip6table_filter
ip6_tables x_tables arc4 iwldvm mac80211 snd_hda_codec_hdmi
snd_hda_codec_conexant iTCO_wdt iTCO_vendor_support mperf coretemp
kvm_intel kvm snd_hda_intel snd_hda_codec microcode snd_hwdep snd_pcm
thinkpad_acpi snd_timer joydev pcspkr sr_mod snd tpm_tis iwlwifi cdrom
e1000e wmi tpm battery ac tpm_bios cfg80211 sdhci_pci soundcore ptp
sdhci snd_page_alloc i2c_i801 rfkill pps_core mmc_core lpc_ich
mfd_core tcp_westwood sg autofs4 btrfs raid6_pq zlib_deflate xor
libcrc32c sha256_generic dm_crypt dm_mod crc32_pclmul
ghash_clmulni_intel crc32c_intel aesni_intel ablk_helper cryptd lrw
aes_x86_64 xts gf128mul thermal i915 drm_kms_helper drm i2c_algo_bit
video button processor thermal_sys scsi_dh_rdac scsi_dh_hp_sw
scsi_dh_emc scsi_dh_alua scsi_dh ata_generic ata_piix
[360491.933095] CPU 3
[360491.933110] Pid: 22166, comm: btrfs-endio-wri Not tainted
3.9.6-1.g8ead728-desktop #1 LENOVO 4239CTO/4239CTO
[360491.933161] RIP: 0010:[<ffffffffa022075b>]  [<ffffffffa022075b>]
__tree_mod_log_rewind+0x23b/0x240 [btrfs]
[360491.933225] RSP: 0018:ffff88014cad7888  EFLAGS: 00010297
[360491.933253] RAX: 0000000000000000 RBX: ffff880065705b40 RCX:
ffff88014cad7828
[360491.933289] RDX: 000000000247cc54 RSI: ffff88014949e821 RDI:
ffff8801bf916640
[360491.933326] RBP: ffff880073048490 R08: 0000000000001000 R09:
ffff88014cad7838
[360491.933362] R10: 0000000000000000 R11: 0000000000000000 R12:
ffff8801c0556640
[360491.933398] R13: 000000000001731e R14: 000000000000003d R15:
ffff880158b76340
[360491.933435] FS:  0000000000000000(0000) GS:ffff88021e2c0000(0000)
knlGS:0000000000000000
[360491.933476] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[360491.933506] CR2: 00007f0a360017f8 CR3: 000000015be03000 CR4:
00000000000407e0
[360491.933543] DR0: 0000000000000000 DR1: 0000000000000000 DR2:
0000000000000000
[360491.933579] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7:
0000000000000400
[360491.933617] Process btrfs-endio-wri (pid: 22166, threadinfo
ffff88014cad6000, task ffff8801a5f8a6c0)
[360491.933662] Stack:
[360491.933674]  ffff88011b573800 ffff88010126e7f0 0000000000000001
0000160000000000
[360491.933719]  ffff880073048490 ffffffffa0228d45 6db6db6db6db6db7
ffff8801c0556640
[360491.933763]  ffff88020e764000 ffff880000000000 000000000001731e
ffff880197cb8158
[360491.933807] Call Trace:
[360491.933848]  [<ffffffffa0228d45>] btrfs_search_old_slot+0x635/0x950 [btrfs]
[360491.933909]  [<ffffffffa02a1ec6>]
__resolve_indirect_refs+0x156/0x640 [btrfs]
[360491.934044]  [<ffffffffa02a2e0c>] find_parent_nodes+0x95c/0x1050 [btrfs]
[360491.934176]  [<ffffffffa02a3592>] btrfs_find_all_roots+0x92/0x100 [btrfs]
[360491.934307]  [<ffffffffa02a401e>] iterate_extent_inodes+0x16e/0x370 [btrfs]
[360491.934440]  [<ffffffffa02a42b8>]
iterate_inodes_from_logical+0x98/0xc0 [btrfs]
[360491.934572]  [<ffffffffa024c1c8>] record_extent_backrefs+0x68/0xe0 [btrfs]
[360491.934652]  [<ffffffffa0256d80>]
btrfs_finish_ordered_io+0x150/0x990 [btrfs]
[360491.934739]  [<ffffffffa0276ef3>] worker_loop+0x153/0x560 [btrfs]
[360491.934833]  [<ffffffff810697c3>] kthread+0xb3/0xc0
[360491.934864]  [<ffffffff815dc6bc>] ret_from_fork+0x7c/0xb0
[360491.934896] DWARF2 unwinder stuck at ret_from_fork+0x7c/0xb0
[360491.934925]
[360491.934934] Leftover inexact backtrace:
[360491.934934]
[360491.934965]  [<ffffffff81069710>] ? kthread_create_on_node+0x120/0x120
[360491.934999] Code: c1 48 63 43 58 48 89 c2 48 c1 e2 05 48 8d 54 10
65 48 63 43 2c 48 89 c6 48 c1 e6 05 48 8d 74 30 65 e8 3a af 04 00 e9
b3 fe ff ff <0f> 0b 0f 0b 90 41 57 41 56 41 55 41 54 55 48 89 fd 53 48
83 ec
[360491.935188] RIP  [<ffffffffa022075b>]
__tree_mod_log_rewind+0x23b/0x240 [btrfs]
[360491.935233]  RSP <ffff88014cad7888>
[360491.946047] ---[ end trace 1475a0830dcadf9c ]---
[360491.946051] note: btrfs-endio-wri[22166] exited with preempt_count 1
Comment 5 liubo 2013-06-26 05:03:53 UTC
Created attachment 106051 [details]
fix ulist_add_merge crash
Comment 6 liubo 2013-06-26 05:04:39 UTC
Comment on attachment 106051 [details]
fix ulist_add_merge crash

Could you please try this patch?
Comment 7 Clemens Eisserer 2013-06-26 05:43:31 UTC
Sure, thanks a lot. I'll give it a try after work.
Comment 8 Ondřej Kunc 2013-06-26 11:05:59 UTC
Hi,

just "me too" on Debian 3.9 kernel, had to reboot about one hour after starting defrag, because server became totally unresponsive on network and console.

dmesg: http://pastebin.com/Ndxypkxa

I will try patch asap. And will inform you about result. I'm not sure if I will be able to reproduce it, because I think, that it was caused by some random problem, when running many concurrent tasks.

Ondrej
Comment 9 Josef Bacik 2013-07-01 12:58:16 UTC
Can you guys try this patch

[PATCH] Btrfs: hold the tree mod lock in __tree_mod_log_rewind

and verify it fixes the problem?
Comment 10 Jon Nelson 2013-07-01 18:14:39 UTC
I managed to try and - for me on 3.9.8 - it does not appear to resolve the issue.
Comment 11 Clemens Eisserer 2013-07-01 18:29:17 UTC
For me it did improve things (3.10.0.rc7) - I can now successfully execute "btrfs fi defrag places.sqlite && sync" without immediate troubles.

However I have experienced two hard lock-ups when using the autodefrag option (no caps-lock led, no mouse-pointer) - I'll use this kernel a few hours and report back when I can find the root-cause of the lock-ups.
Comment 12 Clemens Eisserer 2013-07-01 20:25:53 UTC
I can now confirm that defrag works with the patch, just autodefrag leads to lock-ups quite quickly.

Is "[PATCH] Btrfs: only do the tree_mod_log_free_eb if this is our last ref" expected to help in this case?
Comment 13 Jon Nelson 2013-07-01 20:52:17 UTC
I mis-applied the patch and - so far - defrag does not seem to lock up the machine.
Comment 14 Jon Nelson 2013-07-01 20:59:02 UTC
I spoke too soon.
After (trying to) defrag a few files.... it hung up again.
Comment 15 Josef Bacik 2013-07-02 12:41:57 UTC
Looks like I jumped the gun, there's a second case that needs to be accounted for, can you also apply

    Btrfs: only do the tree_mod_log_free_eb if this is our last ref

and re-test.
Comment 16 Jon Nelson 2013-07-02 15:43:21 UTC
With the second patch applied, I'm able to defrag files.
I did get a failure-to-unmount at shutdown, however, and this came up when defragmenting a fairly large file (5.4GB, over 10,000 extents):


2013-07-02T10:22:58.757314-05:00 linux-3y3i kernel: [  437.616176] ------------[ cut here ]------------
2013-07-02T10:22:58.757336-05:00 linux-3y3i kernel: [  437.616202] WARNING: at fs/btrfs/inode.c:2157 record_one_backref+0x3c1/0x430 [btrfs]()
2013-07-02T10:22:58.757338-05:00 linux-3y3i kernel: [  437.616203] Hardware name: 4239CTO
2013-07-02T10:22:58.757339-05:00 linux-3y3i kernel: [  437.616204] Modules linked in: fuse xt_tcpudp xt_pkttype xt_LOG xt_limit af_packet ip6t_REJECT nf_conntrack_ipv6 nf_defrag_ipv6 ip6table_raw ipt_REJECT iptable_raw xt_CT iptable_filter ip6table_mangle nf_conntrack_netbios_ns nf_conntrack_broadcast nf_conntrack_ipv4 nf_defrag_ipv4 ip_tables xt_conntrack nf_conntrack ip6table_filter ip6_tables x_tables arc4 iwldvm mac80211 iTCO_wdt iTCO_vendor_support snd_hda_codec_hdmi snd_hda_codec_conexant snd_hda_intel snd_hda_codec snd_usb_audio mperf snd_pcm coretemp kvm_intel kvm microcode snd_timer snd_hwdep snd_usbmidi_lib snd_rawmidi snd_seq_device pcspkr thinkpad_acpi snd joydev i2c_i801 iwlwifi sr_mod cdrom lpc_ich sdhci_pci mfd_core sdhci cfg80211 e1000e mmc_core rfkill snd_page_alloc ptp pps_core tpm_tis tpm tpm_bios ac battery soundcore wmi tcp_westwood sg autofs4 btrfs raid6_pq zlib_deflate xor libcrc32c sha256_generic dm_crypt dm_mod ghash_clmulni_intel crc32_pclmul crc32c_intel aesni_intel ablk_helper cryptd lrw aes_x86_64 xts gf128mul i915 drm_kms_helper thermal drm i2c_algo_bit video button processor thermal_sys scsi_dh_rdac scsi_dh_hp_sw scsi_dh_emc scsi_dh_alua scsi_dh ata_generic ata_piix                                                                                                                                                          
2013-07-02T10:22:58.757341-05:00 linux-3y3i kernel: [  437.616239] Pid: 725, comm: btrfs-endio-wri Not tainted 3.9.8-1.gfccf19c-desktop #3                                              
2013-07-02T10:22:58.757342-05:00 linux-3y3i kernel: [  437.616240] Call Trace:                                                                                                          
2013-07-02T10:22:58.757343-05:00 linux-3y3i kernel: [  437.616248]  [<ffffffff81004728>] dump_trace+0x88/0x300                                                                          
2013-07-02T10:22:58.757343-05:00 linux-3y3i kernel: [  437.616253]  [<ffffffff815c8e9b>] dump_stack+0x69/0x6f                                                                           
2013-07-02T10:22:58.757344-05:00 linux-3y3i kernel: [  437.616258]  [<ffffffff81046069>] warn_slowpath_common+0x79/0xc0                                                                 
2013-07-02T10:22:58.757345-05:00 linux-3y3i kernel: [  437.616267]  [<ffffffffa025edc1>] record_one_backref+0x3c1/0x430 [btrfs]                                                                                
2013-07-02T10:22:58.757346-05:00 linux-3y3i kernel: [  437.616303]  [<ffffffffa02b40e5>] iterate_extent_inodes+0x1e5/0x370 [btrfs]                                                                             
2013-07-02T10:22:58.757347-05:00 linux-3y3i kernel: [  437.616366]  [<ffffffffa02b4308>] iterate_inodes_from_logical+0x98/0xc0 [btrfs]                                                                         
2013-07-02T10:22:58.757347-05:00 linux-3y3i kernel: [  437.616425]  [<ffffffffa025c218>] record_extent_backrefs+0x68/0xe0 [btrfs]                                                                              
2013-07-02T10:22:58.757348-05:00 linux-3y3i kernel: [  437.616453]  [<ffffffffa0266dd0>] btrfs_finish_ordered_io+0x150/0x990 [btrfs]                                                                           
2013-07-02T10:22:58.757353-05:00 linux-3y3i kernel: [  437.616504]  [<ffffffffa0286f43>] worker_loop+0x153/0x560 [btrfs]                                                                                       
2013-07-02T10:22:58.758292-05:00 linux-3y3i kernel: [  437.616561]  [<ffffffff810696c3>] kthread+0xb3/0xc0                                                                                                     
2013-07-02T10:22:58.758294-05:00 linux-3y3i kernel: [  437.616566]  [<ffffffff815dca7c>] ret_from_fork+0x7c/0xb0                                                                                               
2013-07-02T10:22:58.758295-05:00 linux-3y3i kernel: [  437.616570] DWARF2 unwinder stuck at ret_from_fork+0x7c/0xb0                                                                                            
2013-07-02T10:22:58.758296-05:00 linux-3y3i kernel: [  437.616571]                                                                                                                                             
2013-07-02T10:22:58.758297-05:00 linux-3y3i kernel: [  437.616572] Leftover inexact backtrace:                                                                                                                                               
2013-07-02T10:22:58.758298-05:00 linux-3y3i kernel: [  437.616572]                                                                                                                                                                           
2013-07-02T10:22:58.758299-05:00 linux-3y3i kernel: [  437.616576]  [<ffffffff81069610>] ? kthread_create_on_node+0x120/0x120                                                                                                                
2013-07-02T10:22:58.758300-05:00 linux-3y3i kernel: [  437.616578] ---[ end trace 33ef856bcbe1968d ]---


Also, I/O was *stopped* during this time, but eventually recovered.
Comment 17 Vlad 2013-09-05 05:07:07 UTC
Created attachment 107423 [details]
BTRFS warning at fs/btrfs/inode.c:2206 record_one_backref+0x371/0x3f0()
Comment 18 Vlad 2013-09-05 05:11:12 UTC
I see about the same warning with 3.11 kernel quite often even without any heavy IO load (see the attachment https://bugzilla.kernel.org/attachment.cgi?id=107423). Although I can't observe any IO freezes. Let me know what kind of additional info is needed.
Comment 19 Josef Bacik 2013-09-09 18:14:28 UTC
It isn't fixed in 3.11, it's fixed in btrfs-next.
Comment 20 Clemens Eisserer 2013-09-09 18:57:45 UTC
sorry, I can no longer test - because I accidentially destroyed my btrfs-partition, while my backups are badly broken.
Currently struggling with btrfs restore SIGSEV when I try to get least some data back (bug 61081).
Comment 21 David Sterba 2013-09-23 13:49:40 UTC
(In reply to Josef Bacik from comment #19)
> It isn't fixed in 3.11, it's fixed in btrfs-next.

The patch mentioned in comment #15 is present in 3.11, that supposedly fixes the crash. The warning in record_one_backref still seems to happen with 3.11.1 (per bug 60524), but will be fixed in 3.11.2 by patch

"Btrfs: reset ret in record_one_backref"
Comment 22 David Sterba 2013-11-05 14:23:31 UTC
The patches are in 3.11.6