Bug 96141

Summary: Replacing missing drive causes kernel errors (cannot create duplicate filename)
Product: File System Reporter: Philip (bugzilla)
Component: btrfsAssignee: Josef Bacik (josef)
Status: RESOLVED CODE_FIX    
Severity: normal CC: osandov
Priority: P1    
Hardware: All   
OS: Linux   
Kernel Version: 3.19.1 Subsystem:
Regression: No Bisected commit-id:

Description Philip 2015-04-04 15:16:17 UTC
Replacing a missing drive causes kernel errors. This is a btrfs RAID6 array with 4 drives. After removing the first drive (devid 1), btrfs fi show listed the remaining 3 drives (devid 2, 3, 4), followed by "*** Some devices missing" (as expected). A new drive was inserted and btrfs replace start 1 /dev/sdb /btrfs was issued. A kernel error (dmesg) followed, which I unfortunately did not save.
After a hard reboot, the replace operation continued but another kernel error was logged (sdb is still the new drive after the reboot):
[ 1199.367832] BTRFS: dev_replace from /dev/sdg (devid 1) to /dev/sdb finished
[ 1199.368059] ------------[ cut here ]------------
[ 1199.368064] WARNING: CPU: 0 PID: 627 at fs/sysfs/dir.c:31 sysfs_warn_dup+0x68/0x80()
[ 1199.368065] sysfs: cannot create duplicate filename '/fs/btrfs/b0c9beef-6971-4da3-9697-6b0d3d75385e/devices/sdb'
[ 1199.368066] Modules linked in: nf_conntrack_netbios_ns nf_conntrack_broadcast ip6t_rpfilter ip6t_REJECT nf_reject_ipv6 xt_conntrack cfg80211 rfkill ebtable_nat ebtable_broute bridge stp llc ebtable_filter ebtables ip6table_nat nf_conntrack_ipv6 nf_defrag_ipv6 nf_nat_ipv6 ip6table_mangle ip6table_security ip6table_raw ip6table_filter ip6_tables iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_nat nf_conntrack iptable_mangle iptable_security iptable_raw intel_rapl iosf_mbi x86_pkg_temp_thermal coretemp kvm_intel iTCO_wdt iTCO_vendor_support kvm crct10dif_pclmul crc32_pclmul i2c_i801 crc32c_intel btrfs lpc_ich mfd_core xor ghash_clmulni_intel raid6_pq ipmi_ssif mei_me mei tpm_tis tpm ipmi_si video ipmi_msghandler acpi_pad shpchp nfsd auth_rpcgss nfs_acl lockd grace sunrpc dm_thin_pool dm_persistent_data
[ 1199.368093]  libcrc32c dm_bio_prison uas usb_storage mpt2sas igb ast drm_kms_helper e1000e ttm drm raid_class dca i2c_algo_bit mvsas ptp pps_core libsas scsi_transport_sas
[ 1199.368102] CPU: 0 PID: 627 Comm: btrfs-devrepl Not tainted 3.19.1-201.fc21.x86_64 #1
[ 1199.368103] Hardware name: Supermicro X10SLM-F/X10SLM-F, BIOS 2.0 04/24/2014
[ 1199.368104]  0000000000000000 00000000eda3be59 ffff8800d888bc88 ffffffff8176d865
[ 1199.368106]  0000000000000000 ffff8800d888bce0 ffff8800d888bcc8 ffffffff8109bc0a
[ 1199.368108]  ffff8800d888bcc8 ffff8800d30b8000 ffff880119d09aa8 ffff8800d6c09960
[ 1199.368109] Call Trace:
[ 1199.368114]  [<ffffffff8176d865>] dump_stack+0x45/0x57
[ 1199.368117]  [<ffffffff8109bc0a>] warn_slowpath_common+0x8a/0xc0
[ 1199.368120]  [<ffffffff8109bc95>] warn_slowpath_fmt+0x55/0x70
[ 1199.368122]  [<ffffffff81294a08>] ? kernfs_path+0x48/0x60
[ 1199.368124]  [<ffffffff81298268>] sysfs_warn_dup+0x68/0x80
[ 1199.368126]  [<ffffffff812985ce>] sysfs_do_create_link_sd.isra.2+0xae/0xc0
[ 1199.368128]  [<ffffffff81298605>] sysfs_create_link+0x25/0x50
[ 1199.368144]  [<ffffffffa03bf3d6>] btrfs_kobj_add_device+0x76/0xc0 [btrfs]
[ 1199.368155]  [<ffffffffa0419fd9>] btrfs_dev_replace_finishing+0x449/0x600 [btrfs]
[ 1199.368157]  [<ffffffff810dd920>] ? wait_woken+0x90/0x90
[ 1199.368166]  [<ffffffffa041a630>] ? btrfs_dev_replace_status+0x100/0x100 [btrfs]
[ 1199.368174]  [<ffffffffa041a69d>] btrfs_dev_replace_kthread+0x6d/0x130 [btrfs]
[ 1199.368181]  [<ffffffffa041a630>] ? btrfs_dev_replace_status+0x100/0x100 [btrfs]
[ 1199.368183]  [<ffffffff810ba458>] kthread+0xd8/0xf0
[ 1199.368185]  [<ffffffff810ba380>] ? kthread_create_on_node+0x1b0/0x1b0
[ 1199.368188]  [<ffffffff81773f7c>] ret_from_fork+0x7c/0xb0
[ 1199.368189]  [<ffffffff810ba380>] ? kthread_create_on_node+0x1b0/0x1b0
[ 1199.368190] ---[ end trace 88a314afd101ae79 ]---

The replace finished and btrfs fi show lists (the new) drive with devid 1 again:
# btrfs fi show /data/
Label: 'DATA'  uuid: b0c9beef-6971-4da3-9697-6b0d3d75385e
	Total devices 4 FS bytes used 121.84GiB
	devid    1 size 2.73TiB used 62.53GiB path /dev/sdb
	devid    2 size 2.73TiB used 62.53GiB path /dev/sdc
	devid    3 size 2.73TiB used 62.53GiB path /dev/sdd
	devid    4 size 2.73TiB used 62.53GiB path /dev/sde

OS: Fedora 21 Server
Kernel: Linux 3.19.1-201.fc21.x86_64
btrfs --version: Btrfs v3.18.1
Comment 1 Philip 2015-04-04 15:34:51 UTC
I have repeated the process, this time with drive 2:
# btrfs fi show /data/
Label: 'DATA'  uuid: b0c9beef-6971-4da3-9697-6b0d3d75385e
	Total devices 4 FS bytes used 121.84GiB
	devid    1 size 2.73TiB used 62.53GiB path /dev/sdb
	devid    3 size 2.73TiB used 62.53GiB path /dev/sdd
	devid    4 size 2.73TiB used 62.53GiB path /dev/sde
	*** Some devices missing

This is what happens right after starting the replace:
[  302.336699] BTRFS: dev_replace from <missing disk> (devid 2) to /dev/sdc started
[  302.626472] BUG: unable to handle kernel NULL pointer dereference at 0000000000000098
[  302.626558] IP: [<ffffffff8135dfda>] bio_add_page+0x1a/0x70
[  302.626617] PGD d8a38067 PUD d8a39067 PMD 0 
[  302.626668] Oops: 0000 [#1] SMP 
[  302.626705] Modules linked in: nf_conntrack_netbios_ns nf_conntrack_broadcast ip6t_rpfilter ip6t_REJECT nf_reject_ipv6 xt_conntrack cfg80211 rfkill ebtable_nat ebtable_broute bridge stp llc ebtable_filter ebtables ip6table_nat nf_conntrack_ipv6 nf_defrag_ipv6 nf_nat_ipv6 ip6table_mangle ip6table_security ip6table_raw ip6table_filter ip6_tables iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_nat nf_conntrack iptable_mangle iptable_security iptable_raw intel_rapl iosf_mbi x86_pkg_temp_thermal coretemp kvm_intel kvm crct10dif_pclmul crc32_pclmul iTCO_wdt iTCO_vendor_support crc32c_intel btrfs xor raid6_pq lpc_ich mfd_core ghash_clmulni_intel i2c_i801 ipmi_ssif acpi_pad video mei_me ipmi_si tpm_tis tpm ipmi_msghandler mei shpchp nfsd auth_rpcgss nfs_acl lockd grace sunrpc dm_thin_pool dm_persistent_data
[  302.627538]  libcrc32c dm_bio_prison uas usb_storage mpt2sas raid_class ast drm_kms_helper ttm igb drm dca e1000e i2c_algo_bit mvsas ptp libsas pps_core scsi_transport_sas
[  302.627727] CPU: 1 PID: 1574 Comm: btrfs Not tainted 3.19.1-201.fc21.x86_64 #1
[  302.627788] Hardware name: Supermicro X10SLM-F/X10SLM-F, BIOS 2.0 04/24/2014
[  302.627847] task: ffff8800d69b26c0 ti: ffff8800d8a44000 task.ti: ffff8800d8a44000
[  302.627909] RIP: 0010:[<ffffffff8135dfda>]  [<ffffffff8135dfda>] bio_add_page+0x1a/0x70
[  302.627982] RSP: 0018:ffff8800d8a47778  EFLAGS: 00010246
[  302.628027] RAX: ffff880119a98ba8 RBX: ffff8800d8364600 RCX: 0000000000000000
[  302.628085] RDX: 0000000000000000 RSI: ffffea000360f540 RDI: ffff880119a98ba8
[  302.628143] RBP: ffff8800d8a47778 R08: 0000000000000000 R09: ffff8800d7d9d000
[  302.628203] R10: 0000000000001000 R11: 0000000000210880 R12: ffff8800d7d9ca18
[  302.628262] R13: ffff8800d8901d80 R14: ffff8800d56f3900 R15: ffff8800d7d9c800
[  302.628322] FS:  00007fb0be607b40(0000) GS:ffff88011fd00000(0000) knlGS:0000000000000000
[  302.628388] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  302.628436] CR2: 0000000000000098 CR3: 00000000d8a37000 CR4: 00000000000407e0
[  302.628496] Stack:
[  302.628516]  ffff8800d8a477f8 ffffffffa0414328 ffff8800d8a477b0 ffff8800d7d9ca20
[  302.628599]  0000000000000000 0000000000000000 ffff8800d8a477f8 ffffffff811ef321
[  302.628667]  ffff8800d69b26c0 00000000cee0a2e6 ffff8800d7d9ca40 0000000000000000
[  302.628736] Call Trace:
[  302.628794]  [<ffffffffa0414328>] scrub_add_page_to_rd_bio+0xc8/0x2b0 [btrfs]
[  302.628852]  [<ffffffff811ef321>] ? alloc_pages_current+0x91/0x110
[  302.628925]  [<ffffffffa0415c9d>] scrub_pages+0x1ed/0x270 [btrfs]
[  302.628995]  [<ffffffffa04178b6>] scrub_stripe+0x886/0x10c0 [btrfs]
[  302.629065]  [<ffffffffa041820f>] scrub_chunk.isra.19+0x11f/0x140 [btrfs]
[  302.629136]  [<ffffffffa04184a9>] scrub_enumerate_chunks+0x279/0x4f0 [btrfs]
[  302.629209]  [<ffffffffa0416fda>] ? scrub_setup_ctx.isra.18+0x23a/0x290 [btrfs]
[  302.629284]  [<ffffffffa0419e1e>] btrfs_scrub_dev+0x1be/0x570 [btrfs]
[  302.629354]  [<ffffffffa042e4ce>] btrfs_dev_replace_start+0x33e/0x3a0 [btrfs]
[  302.629429]  [<ffffffffa03f3f03>] btrfs_ioctl+0x1b93/0x2840 [btrfs]
[  302.629481]  [<ffffffff811ceea6>] ? handle_mm_fault+0x8a6/0xff0
[  302.629528]  [<ffffffff81229aca>] ? path_openat+0xaa/0x660
[  302.629575]  [<ffffffff81063bca>] ? __do_page_fault+0x21a/0x5b0
[  302.631549]  [<ffffffff8122df18>] do_vfs_ioctl+0x2f8/0x500
[  302.633509]  [<ffffffff8122e1a1>] SyS_ioctl+0x81/0xa0
[  302.635464]  [<ffffffff81774029>] system_call_fastpath+0x12/0x17
[  302.637413] Code: e5 e8 3b fd ff ff 5d c3 66 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 55 41 89 d2 48 8b 57 08 48 89 f8 41 89 c8 48 89 e5 4c 8b 58 20 <48> 8b 92 98 00 00 00 48 8b ba 78 03 00 00 44 8b 8f ec 05 00 00 
[  302.641557] RIP  [<ffffffff8135dfda>] bio_add_page+0x1a/0x70
[  302.643374]  RSP <ffff8800d8a47778>
[  302.645173] CR2: 0000000000000098
[  302.647001] ---[ end trace c28c8d5beed1cb13 ]---
[  302.737456] ------------[ cut here ]------------
[  302.739491] WARNING: CPU: 1 PID: 1574 at kernel/exit.c:660 do_exit+0x5f/0xa70()
[  302.741845] Modules linked in: nf_conntrack_netbios_ns nf_conntrack_broadcast ip6t_rpfilter ip6t_REJECT nf_reject_ipv6 xt_conntrack cfg80211 rfkill ebtable_nat ebtable_broute bridge stp llc ebtable_filter ebtables ip6table_nat nf_conntrack_ipv6 nf_defrag_ipv6 nf_nat_ipv6 ip6table_mangle ip6table_security ip6table_raw ip6table_filter ip6_tables iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_nat nf_conntrack iptable_mangle iptable_security iptable_raw intel_rapl iosf_mbi x86_pkg_temp_thermal coretemp kvm_intel kvm crct10dif_pclmul crc32_pclmul iTCO_wdt iTCO_vendor_support crc32c_intel btrfs xor raid6_pq lpc_ich mfd_core ghash_clmulni_intel i2c_i801 ipmi_ssif acpi_pad video mei_me ipmi_si tpm_tis tpm ipmi_msghandler mei shpchp nfsd auth_rpcgss nfs_acl lockd grace sunrpc dm_thin_pool dm_persistent_data
[  302.755623]  libcrc32c dm_bio_prison uas usb_storage mpt2sas raid_class ast drm_kms_helper ttm igb drm dca e1000e i2c_algo_bit mvsas ptp libsas pps_core scsi_transport_sas
[  302.760024] CPU: 1 PID: 1574 Comm: btrfs Tainted: G      D        3.19.1-201.fc21.x86_64 #1
[  302.761992] Hardware name: Supermicro X10SLM-F/X10SLM-F, BIOS 2.0 04/24/2014
[  302.763961]  0000000000000000 00000000cee0a2e6 ffff8800d8a47388 ffffffff8176d865
[  302.765943]  0000000000000000 0000000000000000 ffff8800d8a473c8 ffffffff8109bc0a
[  302.767909]  0000000000000009 ffff8800d69b26c0 0000000000000009 ffff8800d8a476c8
[  302.769878] Call Trace:
[  302.771665]  [<ffffffff8176d865>] dump_stack+0x45/0x57
[  302.773455]  [<ffffffff8109bc0a>] warn_slowpath_common+0x8a/0xc0
[  302.775228]  [<ffffffff8109bd3a>] warn_slowpath_null+0x1a/0x20
[  302.776974]  [<ffffffff8109e1df>] do_exit+0x5f/0xa70
[  302.778724]  [<ffffffff8101877f>] oops_end+0x9f/0xe0
[  302.780469]  [<ffffffff810632ef>] no_context+0x13f/0x3a0
[  302.782017]  [<ffffffff8101e8fa>] ? native_sched_clock+0x2a/0x90
[  302.783563]  [<ffffffff810635cd>] __bad_area_nosemaphore+0x7d/0x210
[  302.785108]  [<ffffffff810637c7>] bad_area+0x47/0x60
[  302.786630]  [<ffffffff81063db6>] __do_page_fault+0x406/0x5b0
[  302.788141]  [<ffffffff8119e585>] ? mempool_alloc_slab+0x15/0x20
[  302.789646]  [<ffffffff81063f91>] do_page_fault+0x31/0x70
[  302.791169]  [<ffffffff81775fe8>] page_fault+0x28/0x30
[  302.792554]  [<ffffffff8135dfda>] ? bio_add_page+0x1a/0x70
[  302.793971]  [<ffffffffa03dcc05>] ? btrfs_io_bio_alloc+0x15/0x40 [btrfs]
[  302.795392]  [<ffffffffa0414328>] scrub_add_page_to_rd_bio+0xc8/0x2b0 [btrfs]
[  302.796793]  [<ffffffff811ef321>] ? alloc_pages_current+0x91/0x110
[  302.798231]  [<ffffffffa0415c9d>] scrub_pages+0x1ed/0x270 [btrfs]
[  302.799664]  [<ffffffffa04178b6>] scrub_stripe+0x886/0x10c0 [btrfs]
[  302.801107]  [<ffffffffa041820f>] scrub_chunk.isra.19+0x11f/0x140 [btrfs]
[  302.802474]  [<ffffffffa04184a9>] scrub_enumerate_chunks+0x279/0x4f0 [btrfs]
[  302.803806]  [<ffffffffa0416fda>] ? scrub_setup_ctx.isra.18+0x23a/0x290 [btrfs]
[  302.805143]  [<ffffffffa0419e1e>] btrfs_scrub_dev+0x1be/0x570 [btrfs]
[  302.806492]  [<ffffffffa042e4ce>] btrfs_dev_replace_start+0x33e/0x3a0 [btrfs]
[  302.807840]  [<ffffffffa03f3f03>] btrfs_ioctl+0x1b93/0x2840 [btrfs]
[  302.809170]  [<ffffffff811ceea6>] ? handle_mm_fault+0x8a6/0xff0
[  302.810489]  [<ffffffff81229aca>] ? path_openat+0xaa/0x660
[  302.811676]  [<ffffffff81063bca>] ? __do_page_fault+0x21a/0x5b0
[  302.812847]  [<ffffffff8122df18>] do_vfs_ioctl+0x2f8/0x500
[  302.814012]  [<ffffffff8122e1a1>] SyS_ioctl+0x81/0xa0
[  302.815178]  [<ffffffff81774029>] system_call_fastpath+0x12/0x17
[  302.816347] ---[ end trace c28c8d5beed1cb14 ]---

Last time I replaced a drive which was still online, I didn't get a kernel error (as far as I remember), so this apparently only happens when a missing drive is replaced.
Comment 2 Philip 2015-04-04 15:43:06 UTC
> [  302.336699] BTRFS: dev_replace from <missing disk> (devid 2) to /dev/sdc
> started

Note that sdc is the new (blank) drive and not the one listed as sdc in the first comment (system was rebooted, so the name has changed).

btrfs fi show does not say which drive is being replaced, but it does list the replacement (devid 0):
# btrfs fi show /data/
Label: 'DATA'  uuid: b0c9beef-6971-4da3-9697-6b0d3d75385e
	Total devices 5 FS bytes used 121.84GiB
	devid    0 size 2.73TiB used 62.53GiB path /dev/sdc
	devid    1 size 2.73TiB used 62.53GiB path /dev/sdb
	devid    3 size 2.73TiB used 62.53GiB path /dev/sdd
	devid    4 size 2.73TiB used 62.53GiB path /dev/sde
	*** Some devices missing

However, the replace operation is stuck at 0.0%, so another (hard) reboot will be necessary for it to start:
# btrfs replace status /data/
0.0% done, 0 write errs, 0 uncorr. read errs

So replacing a missing drive is not possible without at least one (or two) reboot(s).
Comment 3 Philip 2015-04-06 18:41:01 UTC
I've tried replacing a missing drive on another system (btrfs RAID6 as well) with kernel 4.0.0-0.rc5 (btrfs v3.18.2), which doesn't even seem to work after a reboot:
Whenever the btrfs array is mounted (with option "degraded"), the nullpointer error occurs (bio_add_page+0x11/0xa0).

Looks like replacing a failed drive is currently not possible with this kernel, so I'll probably stick with 3.19 for now.
Comment 4 Omar Sandoval 2015-05-10 02:41:12 UTC
Thanks, Philip, I was able to reproduce this and put together a fix. I'll send it out to the list in a couple of days once I've tested it a bit more and Cc you on it.
Comment 5 Philip 2015-10-31 16:33:08 UTC
It's still not working with kernel 4.2.5 (btrfs-progs v4.2.3). Different test system but almost identical setup. 4 drives in btrfs raid6 array (sdb, sdc, sdd, sde). Removed sde (devid 4) and attached another drive offline, booted and mounted (degraded). What used to be device #4 is gone, the other drive is now called sde.

btrfs filesystem show says "Some devices missing" (still doesn't say which one), id of missing device shows up in dmesg:
[  348.441939] BTRFS warning (device sdb): devid 4 uuid 9f02ff90-ca54-45b6-b34d-335e11b68aa3 is missing

Trying to replace the 4th drive still causes a kernel panic due to a nullpointer error:
btrfs replace start -B -f 4 /dev/sde /mnt/tmp/
[  457.271769] BTRFS: dev_replace from <missing disk> (devid 4) to /dev/sde started
[  458.023025] BUG: unable to handle kernel NULL pointer dereference at 0000000000000098
[  458.024593] IP: [<ffffffff8127fd71>] bio_add_page+0x11/0x90
[  458.024949] PGD ba348067 PUD b9ced067 PMD 0 
[  458.025295] Oops: 0000 [#1] PREEMPT SMP 
[  458.025304] Modules linked in: cfg80211 rfkill mousedev iosf_mbi crct10dif_pclmul crc32_pclmul ppdev evdev input_leds led_class aesni_intel mac_hid psmouse serio_raw aes_x86_64 pcspkr lrw gf128mul glue_helper ablk_helper cryptd parport_pc parport battery video ac snd_intel8x0 snd_ac97_codec ac97_bus snd_pcm snd_timer button snd intel_agp acpi_cpufreq intel_gtt e1000 processor i2c_piix4 soundcore nfsd auth_rpcgss oid_registry nfs_acl sch_fq_codel nfs lockd grace sunrpc fscache ip_tables x_tables btrfs xor raid6_pq sr_mod cdrom sd_mod ata_generic pata_acpi atkbd libps2 ata_piix ahci libahci mptsas ohci_pci scsi_transport_sas ohci_hcd mptscsih ehci_pci ehci_hcd crc32c_intel mptbase usbcore usb_common libata scsi_mod i8042 serio
[  458.029065] CPU: 1 PID: 500 Comm: btrfs Not tainted 4.2.5-1-ARCH #1
[  458.030126] Hardware name: innotek GmbH VirtualBox/VirtualBox, BIOS VirtualBox 12/01/2006
[  458.030900] task: ffff8800babc0dc0 ti: ffff880037870000 task.ti: ffff880037870000
[  458.031681] RIP: 0010:[<ffffffff8127fd71>]  [<ffffffff8127fd71>] bio_add_page+0x11/0x90
[  458.032479] RSP: 0018:ffff880037873748  EFLAGS: 00010246
[  458.032886] RAX: 0000000000000000 RBX: ffff880037912000 RCX: 0000000000000000
[  458.033294] RDX: 0000000000001000 RSI: ffffea0002d37440 RDI: ffff8800ba63ace8
[  458.033695] RBP: ffff880037873748 R08: ffff8800378b3800 R09: 0000000000000820
[  458.034094] R10: ffffffff8172266e R11: 0000000000000000 R12: ffff880037912218
[  458.034565] R13: ffff880037912220 R14: ffff8800ba155800 R15: ffff8800b4e14200
[  458.034958] FS:  00007f9f136528c0(0000) GS:ffff8800bfb00000(0000) knlGS:0000000000000000
[  458.035724] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  458.036109] CR2: 0000000000000098 CR3: 00000000ba37d000 CR4: 00000000000406e0
[  458.036498] Stack:
[  458.036877]  ffff8800378737c8 ffffffffa0357682 ffff880037873780 ffff8800bbb97300
[  458.037659]  0000000000000000 0000000000000000 ffff8800378737c8 ffffffff811a83a1
[  458.038442]  ffff8800babc0dc0 000000009283167e 0000000001410000 ffff8800bbb97300
[  458.039252] Call Trace:
[  458.039653]  [<ffffffffa0357682>] scrub_add_page_to_rd_bio+0xc2/0x280 [btrfs]
[  458.040047]  [<ffffffff811a83a1>] ? alloc_pages_current+0x91/0x100
[  458.040436]  [<ffffffffa035910e>] scrub_pages+0x1de/0x260 [btrfs]
[  458.040818]  [<ffffffffa035a6fd>] scrub_stripe+0x86d/0x1010 [btrfs]
[  458.041190]  [<ffffffff81160e8c>] ? __alloc_pages_nodemask+0x17c/0x960
[  458.041561]  [<ffffffffa035c9ab>] scrub_chunk.isra.7+0x10b/0x130 [btrfs]
[  458.041929]  [<ffffffffa035cc45>] scrub_enumerate_chunks+0x275/0x4d0 [btrfs]
[  458.042290]  [<ffffffffa035c83c>] ? scrub_setup_ctx.isra.6+0x21c/0x280 [btrfs]
[  458.042992]  [<ffffffffa035d060>] btrfs_scrub_dev+0x1c0/0x530 [btrfs]
[  458.043363]  [<ffffffffa0370939>] btrfs_dev_replace_start+0x359/0x3c0 [btrfs]
[  458.043714]  [<ffffffffa0335fda>] btrfs_ioctl+0x1b5a/0x2ac0 [btrfs]
[  458.044058]  [<ffffffff81099d02>] ? finish_task_switch+0x62/0x1b0
[  458.044399]  [<ffffffff81572140>] ? __schedule+0x340/0xa00
[  458.044748]  [<ffffffff813cfac7>] ? put_device+0x17/0x20
[  458.045083]  [<ffffffffa003a4ff>] ? scsi_device_put+0x2f/0x40 [scsi_mod]
[  458.045420]  [<ffffffff811eafd2>] ? iput+0x42/0x240
[  458.045754]  [<ffffffff812099fa>] ? __blkdev_put+0x18a/0x1f0
[  458.046083]  [<ffffffff811e2b65>] do_vfs_ioctl+0x295/0x480
[  458.046404]  [<ffffffff811efa54>] ? mntput+0x24/0x40
[  458.046781]  [<ffffffff811e2dc9>] SyS_ioctl+0x79/0x90
[  458.047097]  [<ffffffff8157626e>] entry_SYSCALL_64_fastpath+0x12/0x71
[  458.047414] Code: 48 89 e5 e8 62 fd ff ff 5d c3 31 c0 c3 0f 1f 00 66 2e 0f 1f 84 00 00 00 00 00 66 66 66 66 90 55 48 8b 47 08 4c 8b 4f 20 48 89 e5 <48> 8b 80 98 00 00 00 4c 8b 90 80 03 00 00 41 8b 82 fc 06 00 00 
[  458.048396] RIP  [<ffffffff8127fd71>] bio_add_page+0x11/0x90
[  458.048702]  RSP <ffff880037873748>
[  458.048993] CR2: 0000000000000098
[  458.050511] ---[ end trace fa555451fb36112e ]---

After this, the system gets stuck trying to shut down:
suspending dev_replace for unmount

After a hard reset and mounting it again (degraded), the same nullpointer error occurs:
[   81.747824] BTRFS: continuing dev_replace from <missing disk> (devid 4) to /dev/sde @0%
[   81.751096] BUG: unable to handle kernel NULL pointer dereference at 0000000000000098
[   81.752324] IP: [<ffffffff8127fd71>] bio_add_page+0x11/0x90
[   81.752638] PGD b9ae4067 PUD b9aef067 PMD 0 
[   81.752941] Oops: 0000 [#1] PREEMPT SMP 
[   81.753235] Modules linked in: nfsv3 cfg80211 rfkill iosf_mbi mousedev crct10dif_pclmul ppdev crc32_pclmul aesni_intel evdev input_leds aes_x86_64 psmouse lrw gf128mul glue_helper ablk_helper cryptd pcspkr led_class mac_hid serio_raw snd_intel8x0 snd_ac97_codec ac97_bus snd_pcm snd_timer snd parport_pc soundcore acpi_cpufreq intel_agp intel_gtt parport battery ac i2c_piix4 video button processor e1000 sch_fq_codel nfsd auth_rpcgss oid_registry nfs_acl nfs lockd grace sunrpc fscache ip_tables x_tables btrfs xor raid6_pq sr_mod cdrom sd_mod ata_generic pata_acpi atkbd libps2 mptsas ata_piix ahci libahci ehci_pci ohci_pci ohci_hcd ehci_hcd crc32c_intel libata scsi_transport_sas mptscsih scsi_mod mptbase usbcore usb_common i8042 serio
[   81.756363] CPU: 0 PID: 491 Comm: btrfs-devrepl Not tainted 4.2.5-1-ARCH #1
[   81.756698] Hardware name: innotek GmbH VirtualBox/VirtualBox, BIOS VirtualBox 12/01/2006
[   81.757380] task: ffff8800b9b3e040 ti: ffff8800b485c000 task.ti: ffff8800b485c000
[   81.758102] RIP: 0010:[<ffffffff8127fd71>]  [<ffffffff8127fd71>] bio_add_page+0x11/0x90
[   81.758864] RSP: 0018:ffff8800b485f938  EFLAGS: 00010246
[   81.759270] RAX: 0000000000000000 RBX: ffff8800b9a31c00 RCX: 0000000000000000
[   81.759899] RDX: 0000000000001000 RSI: ffffea0002d20c40 RDI: ffff8800b9a85568
[   81.760860] RBP: ffff8800b485f938 R08: ffffffff812803af R09: 0000000000000820
[   81.761250] R10: ffffffff8172266e R11: 0000000000000000 R12: ffff8800b9a31e18
[   81.761644] R13: ffff8800b9a31e20 R14: ffff8800b9a86980 R15: ffff8800b9bdda00
[   81.762036] FS:  0000000000000000(0000) GS:ffff8800bfa00000(0000) knlGS:0000000000000000
[   81.762811] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
[   81.763201] CR2: 0000000000000098 CR3: 00000000b9ade000 CR4: 00000000000406f0
[   81.763612] Stack:
[   81.764009]  ffff8800b485f9b8 ffffffffa0273682 ffff8800b485f970 ffff8800b9ec2cc0
[   81.764800]  0000000000000000 0000000000000000 ffff8800b485f9b8 ffffffff811a83a1
[   81.765604]  ffff8800b9b3e040 000000004373c4d1 0000000001410000 ffff8800b9ec2cc0
[   81.766396] Call Trace:
[   81.766937]  [<ffffffffa0273682>] scrub_add_page_to_rd_bio+0xc2/0x280 [btrfs]
[   81.767353]  [<ffffffff811a83a1>] ? alloc_pages_current+0x91/0x100
[   81.767752]  [<ffffffffa027510e>] scrub_pages+0x1de/0x260 [btrfs]
[   81.768137]  [<ffffffffa02766fd>] scrub_stripe+0x86d/0x1010 [btrfs]
[   81.768512]  [<ffffffff81160e8c>] ? __alloc_pages_nodemask+0x17c/0x960
[   81.768885]  [<ffffffffa02789ab>] scrub_chunk.isra.7+0x10b/0x130 [btrfs]
[   81.769254]  [<ffffffffa0278c45>] scrub_enumerate_chunks+0x275/0x4d0 [btrfs]
[   81.769617]  [<ffffffffa027883c>] ? scrub_setup_ctx.isra.6+0x21c/0x280 [btrfs]
[   81.770309]  [<ffffffffa0279060>] btrfs_scrub_dev+0x1c0/0x530 [btrfs]
[   81.770661]  [<ffffffff8156feaf>] ? printk+0x55/0x6b
[   81.771011]  [<ffffffffa028ca90>] ? btrfs_dev_replace_status+0xf0/0xf0 [btrfs]
[   81.771754]  [<ffffffffa028caf3>] btrfs_dev_replace_kthread+0x63/0x130 [btrfs]
[   81.772449]  [<ffffffff81092578>] kthread+0xd8/0xf0
[   81.772803]  [<ffffffff810924a0>] ? kthread_worker_fn+0x170/0x170
[   81.773159]  [<ffffffff8157665f>] ret_from_fork+0x3f/0x70
[   81.773508]  [<ffffffff810924a0>] ? kthread_worker_fn+0x170/0x170
[   81.773866] Code: 48 89 e5 e8 62 fd ff ff 5d c3 31 c0 c3 0f 1f 00 66 2e 0f 1f 84 00 00 00 00 00 66 66 66 66 90 55 48 8b 47 08 4c 8b 4f 20 48 89 e5 <48> 8b 80 98 00 00 00 4c 8b 90 80 03 00 00 41 8b 82 fc 06 00 00 
[   81.775001] RIP  [<ffffffff8127fd71>] bio_add_page+0x11/0x90
[   81.775360]  RSP <ffff8800b485f938>
[   81.775705] CR2: 0000000000000098
[   81.776030] ---[ end trace 980a9771c0c1dfdd ]---

So replacing a missing drive in a raid6 array is still impossible with 4.2 (even though it worked for me once using 3.19).
Comment 6 Omar Sandoval 2015-10-31 19:45:53 UTC
Philip, the fix for this was merged in 4.3-rc1. Have you tried an rc for 4.3?
Comment 7 Philip 2015-10-31 22:56:57 UTC
I've just tried this with kernel version 4.3-rc7. It works for me now:
BTRFS: dev_replace from <missing disk> (devid 4) to /dev/sde started
BTRFS: dev_replace from <missing disk> (devid 4) to /dev/sde finished

        Total devices 4 FS bytes used 3.00GiB
        devid    1 size 10.00GiB used 3.28GiB path /dev/sdb
        devid    2 size 10.00GiB used 3.28GiB path /dev/sdc
        devid    3 size 10.00GiB used 3.28GiB path /dev/sdd
        devid    4 size 10.00GiB used 3.00GiB path /dev/sde

Looks like Btrfs raid6 can handle offline drive replacements now, wonderful! Thank you very much.