Created attachment 212061 [details] Context around NULL pointer dereference Testing an all SSD RAID-10 device by pulling one SSD caused the Oops. Attached is the bracketing context. Apr 7 14:00:35 38-hw-rig2-L3-2 kernel: BUG: unable to handle kernel NULL pointer dereference at 0000000000000058 Apr 7 14:00:35 38-hw-rig2-L3-2 kernel: IP: [<ffffffff814cb5fd>] rdev_clear_badblocks+0xd/0x40 Apr 7 14:00:35 38-hw-rig2-L3-2 kernel: PGD 201c1a2067 PUD 14f9fa3067 PMD 0 Apr 7 14:00:35 38-hw-rig2-L3-2 kernel: Oops: 0000 [#1] SMP Apr 7 14:00:35 38-hw-rig2-L3-2 kernel: Modules linked in: xt_nat(E) raid10(E) ext4(E) jbd2(E) mbcache(E) xt_conntrack(E) ipt_MASQUERADE(E) nf_nat_masquerade_ipv4(E) iptable_nat(E) nf_conntrack_ipv4(E) nf_defrag_ipv4(E) nf_nat_ipv4(E) xt_addrtype(E) br_netfilter(E) nf_nat(E) nf_conntrack(E) bridge(E) stp(E) llc(E) dm_thin_pool(E) dm_persistent_data(E) dm_bio_prison(E) dm_bufio(E) ipt_REJECT(E) nf_reject_ipv4(E) iptable_filter(E) ip_tables(E) dm_service_time(E) mpt2sas(E) mptctl(E) mptbase(E) iTCO_wdt(E) iTCO_vendor_support(E) x86_pkg_temp_thermal(E) coretemp(E) kvm_intel(E) kvm(E) crc32c_intel(E) ghash_clmulni_intel(E) aesni_intel(E) aes_x86_64(E) glue_helper(E) lrw(E) gf128mul(E) ablk_helper(E) cryptd(E) microcode(E) pcspkr(E) sb_edac(E) lpc_ich(E) edac_core(E) bonding(E) mfd_core(E) ses(E) enclosure(E) ipmi_devintf(E) acpi_cpufreq(E) Apr 7 14:00:35 38-hw-rig2-L3-2 kernel: shpchp(E) wmi(E) dm_multipath(E) dm_mod(E) xfs(E) libcrc32c(E) exportfs(E) sd_mod(E) ahci(E) mpt3sas(E) i40e(E) raid_class(E) libahci(E) configfs(E) scsi_transport_sas(E) megaraid_sas(E) ipv6(E) autofs4(E) Apr 7 14:00:35 38-hw-rig2-L3-2 kernel: CPU: 13 PID: 26099 Comm: md1_raid10 Tainted: G E 4.1.8-3.clearsky.el7.centos.x86_64 #1 Apr 7 14:00:35 38-hw-rig2-L3-2 kernel: Hardware name: /0CNCJW, BIOS 1.5.4 10/002/2015 Apr 7 14:00:35 38-hw-rig2-L3-2 kernel: task: ffff880fd738e0d0 ti: ffff880072324000 task.ti: ffff880072324000 Apr 7 14:00:35 38-hw-rig2-L3-2 kernel: RIP: 0010:[<ffffffff814cb5fd>] [<ffffffff814cb5fd>] rdev_clear_badblocks+0xd/0x40 Apr 7 14:00:35 38-hw-rig2-L3-2 kernel: RSP: 0018:ffff880072327d08 EFLAGS: 00010246 Apr 7 14:00:35 38-hw-rig2-L3-2 kernel: RAX: f000000000000001 RBX: ffff880fd7b41a00 RCX: 0000000000000000 Apr 7 14:00:35 38-hw-rig2-L3-2 kernel: RDX: 0000000000000008 RSI: 0000000000000000 RDI: 0000000000000000 Apr 7 14:00:35 38-hw-rig2-L3-2 kernel: RBP: ffff880072327d08 R08: ffff88201ed53b38 R09: ffff88201ed53a00 Apr 7 14:00:35 38-hw-rig2-L3-2 kernel: R10: 0000000000000000 R11: ffff882033196258 R12: ffff88201e3971f8 Apr 7 14:00:35 38-hw-rig2-L3-2 kernel: R13: 0000000000000000 R14: ffff88201e3971f8 R15: 0000000000000000 Apr 7 14:00:35 38-hw-rig2-L3-2 kernel: FS: 0000000000000000(0000) GS:ffff88207f980000(0000) knlGS:0000000000000000 Apr 7 14:00:35 38-hw-rig2-L3-2 kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 Apr 7 14:00:35 38-hw-rig2-L3-2 kernel: CR2: 0000000000000058 CR3: 0000001477c70000 CR4: 00000000001406e0 Apr 7 14:00:35 38-hw-rig2-L3-2 kernel: Stack: Apr 7 14:00:35 38-hw-rig2-L3-2 kernel: ffff880072327e38 ffffffffa3c62503 ffff880fd7b41aa8 ffff88207f98d5c0 Apr 7 14:00:35 38-hw-rig2-L3-2 kernel: ffff880fd7b41a00 0000000000000286 ffff880072327d68 ffff88201e3971c0 Apr 7 14:00:35 38-hw-rig2-L3-2 kernel: ffff880072327db8 0000000000000286 ffff880072327db8 ffff880072327db8 Apr 7 14:00:35 38-hw-rig2-L3-2 kernel: Call Trace: Apr 7 14:00:35 38-hw-rig2-L3-2 kernel: [<ffffffffa3c62503>] raid10d+0x4e3/0x1580 [raid10] Apr 7 14:00:35 38-hw-rig2-L3-2 kernel: [<ffffffff810af912>] ? del_timer_sync+0x52/0x60 Apr 7 14:00:35 38-hw-rig2-L3-2 kernel: [<ffffffff81093ce4>] ? prepare_to_wait_event+0x84/0x100 Apr 7 14:00:35 38-hw-rig2-L3-2 kernel: [<ffffffff814cbe16>] md_thread+0x136/0x140 Apr 7 14:00:35 38-hw-rig2-L3-2 kernel: [<ffffffff81093ba0>] ? wait_woken+0x90/0x90 Apr 7 14:00:35 38-hw-rig2-L3-2 kernel: [<ffffffff814cbce0>] ? find_pers+0x80/0x80 Apr 7 14:00:35 38-hw-rig2-L3-2 kernel: [<ffffffff8106fc8b>] kthread+0xdb/0x100 Apr 7 14:00:35 38-hw-rig2-L3-2 kernel: [<ffffffff8106fbb0>] ? kthread_create_on_node+0x170/0x170 Apr 7 14:00:35 38-hw-rig2-L3-2 kernel: [<ffffffff815f3692>] ret_from_fork+0x42/0x70 Apr 7 14:00:35 38-hw-rig2-L3-2 kernel: [<ffffffff8106fbb0>] ? kthread_create_on_node+0x170/0x170 Apr 7 14:00:35 38-hw-rig2-L3-2 kernel: Code: c1 e0 09 4c 09 f8 48 85 c9 b9 00 00 00 00 48 0f 49 d1 48 09 d0 48 89 06 e9 d1 fe ff ff 90 0f 1f 44 00 00 55 85 c9 48 89 e5 75 13 <48> 03 77 58 48 81 c7 38 01 00 00 e8 83 fd ff ff 5d c3 90 48 03 Apr 7 14:00:35 38-hw-rig2-L3-2 kernel: RIP [<ffffffff814cb5fd>] rdev_clear_badblocks+0xd/0x40 Apr 7 14:00:35 38-hw-rig2-L3-2 kernel: RSP <ffff880072327d08> Apr 7 14:00:35 38-hw-rig2-L3-2 kernel: CR2: 0000000000000058 Apr 7 14:00:35 38-hw-rig2-L3-2 kernel: ---[ end trace 1fdc17991b09f242 ]---
any chance you can try a latest kernel?
Which one, 4.4 or 4.5? I can do either. This is for a product so we are looking for a fix in a long term support kernel.
I'll prefer 4.5. I'd like to check if upstream kernel has the issue. If we find the fix, back port to the kernel you use is easy. I tried to reproduce the issue here (eg, directly remove a pci controller and the disk), but didn't success yet, so please describe how you do the test.
We have 12 SSD's in a dual ported SAS JBOD. We enable MPIO on all of the drives. We make 10 partitions on each drive so we have 120 partitions. We then take the matching partitions on each drive and create an MD RAID-10 device with two mirror sides. 10 of the partitions are for data and two are for spares. The test was: With I/O running to the JBOD, physically remove one data drive. The expected result was that one of the spare drives would be selected, made active and 10 partitions would be re-silvered. We will attempt to reproduce with our original configuration (4.1.8 kernel). If we can reproduce this readily we will upgrade to 4.5 and try again. We may be able to report tomorrow.
Great. there are 4 places calling rdev_clear_badblocks(), could you add BUG_ON(!rdev) there and report which one triggered the bug? Maybe spares related, I'll try here too.
We were finally able to put all of the ingredients into the same pot. Using linux 4.6 with raid10.c patched as requested we were able to reproduce this. It happens every time we try it. I have attached the extract from /var/log/messages related to the crash.
Created attachment 217951 [details] messages extract after reproducing in 4.6 This is the kernel spew that came out when we reproduced the crash.
Created attachment 218501 [details] fix Can you try if this fixes it
Created attachment 220031 [details] dmesg output from both controllers We are trying to reproduce this issue with our original kernel (without your suggested change). We haven't succeeded but we did get a kernel warning on each controller during our last attempt. I will attach the dmesg output from each controller.
Created attachment 220041 [details] second dmesg output
Matthew Martin and I haven't been able to reproduce this issue until recently. We're now using a 4.9.89 kernel, and when I add a BUG_ON() in raid10.c here: @@ -2655,6 +2656,7 @@ static void handle_write_completed(struct r10conf *conf, struct r10bio *r10_bio) continue; if (!r10_bio->devs[m].repl_bio->bi_error) { + BUG_ON(!rdev); rdev_clear_badblocks( rdev, r10_bio->devs[m].addr, And when I test with this change (using "echo 1 > /sys/block/XXX/device/delete" while doing I/O), the kernel asserts at that BUG_ON().
Created attachment 280357 [details] add rcu protection This is based on 5.0 kernel, so pls make related changes to your version
Could you try (In reply to gary from comment #12) > Created attachment 280357 [details] > add rcu protection > > This is based on 5.0 kernel, so pls make related changes to your version Could you try this change? Though I am not sure if it can fix your problem or not.
Created attachment 280359 [details] add WQ_MEM_RECLAIM flag for firmware_event_thread
(In reply to Matthew L. Martin from comment #10) > Created attachment 220041 [details] > second dmesg output Maybe the complain about "WQ_MEM_RECLAIM fw_event_mpt3sas0:_firmware_event_work [mpt3sas] is flushing !WQ_MEM_RECLAIM events:lru_add_drain_per_cpu" in dmesgs can be suppressed by attachment 280359 [details].
(In reply to Guoqing from comment #15) > (In reply to Matthew L. Martin from comment #10) > > Created attachment 220041 [details] > > second dmesg output > > Maybe the complain about "WQ_MEM_RECLAIM > fw_event_mpt3sas0:_firmware_event_work [mpt3sas] is flushing !WQ_MEM_RECLAIM > events:lru_add_drain_per_cpu" in dmesgs can be suppressed by attachment > 280359 [details]. Hmm, it is wrong, commit f3a932baa7f65072434f1c04c02c8a4d2746fcfc "mm: introduce dedicated WQ_MEM_RECLAIM workqueue to do lru_add_drain_all" is the right fix from my understanding.
I haven't had time to look at this in a while; I'll try the change in Comment 12 when I get a chance but I don't know when that will be. FWIW, it seems like maybe this change: commit 3f5af7cc105f9bf5e6995db27e20c0f83ad5abcb Author: Yufen Yu <yuyufen@huawei.com> Date: Tue Feb 6 17:39:15 2018 +0800 md raid10: fix NULL deference in handle_write_completed() [ Upstream commit 01a69cab01c184d3786af09e9339311123d63d22 ] In the case of 'recover', an r10bio with R10BIO_WriteError & R10BIO_IsRecover will be progressed by handle_write_completed(). This function traverses all r10bio->devs[copies]. If devs[m].repl_bio != NULL, it thinks conf->mirrors[dev].replacement is also not NULL. However, this is not always true. ... prevents the panic we've been seeing. I have to double check though, so please take it with a grain of salt.