Bug 196749 - kernel BUG at block/blk-core.c:2054
Summary: kernel BUG at block/blk-core.c:2054
Status: RESOLVED CODE_FIX
Alias: None
Product: IO/Storage
Classification: Unclassified
Component: MD (show other bugs)
Hardware: x86-64 Linux
: P1 high
Assignee: io_md
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2017-08-23 23:38 UTC by Jonathan G. Underwood
Modified: 2017-10-03 16:25 UTC (History)
1 user (show)

See Also:
Kernel Version: Linux version 4.12.5-300.fc26.x86_64 (mockbuild@bkernel01.phx2.fedoraproject.org) (gcc version 7.1.1 20170622 (Red Hat 7.1.1-3) (GCC) ) #1 SMP Mon Aug 7 15:27:25 UTC 2017
Subsystem:
Regression: No
Bisected commit-id:


Attachments
debug patch (433 bytes, patch)
2017-08-24 22:57 UTC, Shaohua Li
Details | Diff
debug patch (767 bytes, patch)
2017-08-24 23:03 UTC, Shaohua Li
Details | Diff

Description Jonathan G. Underwood 2017-08-23 23:38:50 UTC
I keep hitting the bug detailed in the backtrace below. I am running Fedora 26 on an Hp Microserver Gen 8. I have a 2 disk RAID 1 array set up with mdadm comprising two 2TB disks. I set up the raid array as follows:

mdadm -R --create --verbose /dev/md0 --level=1 --raid-devices=2 /dev/sda1 /dev/sdb1

I am hitting the bug when attempting to write random data to the md device by doing the following:

cryptsetup open --type plain /dev/md0 container --key-file /dev/random
dd bs=1M if=/dev/zero of=/dev/mapper/container status=progress

At some point during the dd invocation, after a few GB has been written, I see the following bug:



[  464.808442] ------------[ cut here ]------------
[  464.808445] kernel BUG at block/blk-core.c:2054!
[  464.808506] invalid opcode: 0000 [#1] SMP
[  464.808561] Modules linked in: dm_crypt ip6t_rpfilter ip6t_REJECT nf_reject_ipv6 xt_conntrack ip_set nfnetlink ebtable_nat ebtable_broute bridge stp llc ip6table_nat nf_conntrack_ipv6 nf_defrag_ipv6 nf_nat_ipv6 ip6table_mangle ip6table_raw ip6table_security iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_nat nf_conntrack libcrc32c iptable_mangle iptable_raw iptable_security ebtable_filter ebtables ip6table_filter ip6_tables sunrpc intel_rapl x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel kvm irqbypass crct10dif_pclmul crc32_pclmul ghash_clmulni_intel intel_cstate iTCO_wdt iTCO_vendor_support gpio_ich ipmi_ssif intel_uncore intel_rapl_perf raid1 ipmi_si ipmi_devintf hpwdt hpilo ipmi_msghandler acpi_power_meter pcc_cpufreq lpc_ich tpm_tis tpm_tis_core shpchp ie31200_edac tpm i2c_algo_bit
[  464.808724]  drm_kms_helper ttm crc32c_intel drm uas serio_raw usb_storage tg3 ptp pps_core
[  464.808790] CPU: 0 PID: 493 Comm: md0_resync Not tainted 4.12.5-300.fc26.x86_64 #1
[  464.808848] Hardware name: HP ProLiant MicroServer Gen8, BIOS J06 11/02/2015
[  464.808906] task: ffff8a68eb41c980 task.stack: ffffaf2881e8c000
[  464.808968] RIP: 0010:generic_make_request+0x2c4/0x2d0
[  464.809025] RSP: 0000:ffffaf2881e8fbe0 EFLAGS: 00010286
[  464.809241] RAX: ffff8a68eb41c980 RBX: ffff8a6836195a00 RCX: 0000000000000000
[  464.809460] RDX: 0000000000000402 RSI: 0000000000000001 RDI: ffff8a6836abedb8
[  464.809681] RBP: ffffaf2881e8fc30 R08: 0000000000000010 R09: 0000000000001000
[  464.809900] R10: ffffaf2881e8fc48 R11: 00000000ffffffff R12: 0000000000000008
[  464.810120] R13: ffff8a683601f600 R14: 0000000000000080 R15: ffff8a68eb8d8000
[  464.810341] FS:  0000000000000000(0000) GS:ffff8a690a600000(0000) knlGS:0000000000000000
[  464.810721] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  464.810938] CR2: 00007fb59a851000 CR3: 000000005d01c000 CR4: 00000000001406f0
[  464.811161] Call Trace:
[  464.811381]  ? bitmap_start_sync+0x56/0x100
[  464.811601]  raid1_sync_request+0xadf/0xb90 [raid1]
[  464.811820]  ? raid1_sync_request+0xadf/0xb90 [raid1]
[  464.812039]  ? is_mddev_idle+0xa6/0x10a
[  464.812256]  md_do_sync+0x8cd/0xee0
[  464.812475]  ? finish_wait+0x80/0x80
[  464.812693]  md_thread+0x125/0x170
[  464.812910]  ? md_thread+0x125/0x170
[  464.813128]  kthread+0x125/0x140
[  464.813346]  ? find_pers+0x70/0x70
[  464.813560]  ? kthread_park+0x60/0x60
[  464.813778]  ? do_syscall_64+0x67/0x140
[  464.814001]  ret_from_fork+0x25/0x30
[  464.814219] Code: bc 24 68 07 00 00 f0 49 83 ac 24 68 07 00 00 01 0f 85 85 fe ff ff 41 ff 94 24 78 07 00 00 e9 78 fe ff ff 4c 89 28 e9 ae fd ff ff <0f> 0b e8 95 6e cc ff 0f 1f 44 00 00 0f 1f 44 00 00 55 48 89 e5 
[  464.814812] RIP: generic_make_request+0x2c4/0x2d0 RSP: ffffaf2881e8fbe0
[  464.815058] ---[ end trace 2b032b91acd7b852 ]---
[  464.820281] ------------[ cut here ]------------
[  464.820511] WARNING: CPU: 0 PID: 493 at kernel/exit.c:785 do_exit+0x51/0xb50
[  464.820734] Modules linked in: dm_crypt ip6t_rpfilter ip6t_REJECT nf_reject_ipv6 xt_conntrack ip_set nfnetlink ebtable_nat ebtable_broute bridge stp llc ip6table_nat nf_conntrack_ipv6 nf_defrag_ipv6 nf_nat_ipv6 ip6table_mangle ip6table_raw ip6table_security iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_nat nf_conntrack libcrc32c iptable_mangle iptable_raw iptable_security ebtable_filter ebtables ip6table_filter ip6_tables sunrpc intel_rapl x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel kvm irqbypass crct10dif_pclmul crc32_pclmul ghash_clmulni_intel intel_cstate iTCO_wdt iTCO_vendor_support gpio_ich ipmi_ssif intel_uncore intel_rapl_perf raid1 ipmi_si ipmi_devintf hpwdt hpilo ipmi_msghandler acpi_power_meter pcc_cpufreq lpc_ich tpm_tis tpm_tis_core shpchp ie31200_edac tpm i2c_algo_bit
[  464.822702]  drm_kms_helper ttm crc32c_intel drm uas serio_raw usb_storage tg3 ptp pps_core
[  464.823095] CPU: 0 PID: 493 Comm: md0_resync Tainted: G      D         4.12.5-300.fc26.x86_64 #1
[  464.823487] Hardware name: HP ProLiant MicroServer Gen8, BIOS J06 11/02/2015
[  464.823710] task: ffff8a68eb41c980 task.stack: ffffaf2881e8c000
[  464.823933] RIP: 0010:do_exit+0x51/0xb50
[  464.824151] RSP: 0000:ffffaf2881e8fed8 EFLAGS: 00010202
[  464.824373] RAX: ffffaf2881e8fd80 RBX: ffff8a68eb41c980 RCX: 0000000000000000
[  464.824598] RDX: ffff8a68df4a2400 RSI: 0000000000000000 RDI: ffffffff94ef8d60
[  464.824821] RBP: ffffaf2881e8ff48 R08: 00000000000e9000 R09: 00000000000e8025
[  464.825043] R10: 0000000000000008 R11: 0000000000000000 R12: 000000000000000b
[  464.825266] R13: ffffaf2881e8fb38 R14: 0000000000000000 R15: ffffaf2881e8fb38
[  464.825493] FS:  0000000000000000(0000) GS:ffff8a690a600000(0000) knlGS:0000000000000000
[  464.825880] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  464.826100] CR2: 00007fb59a851000 CR3: 000000005d01c000 CR4: 00000000001406f0
[  464.826322] Call Trace:
[  464.826546]  ? kthread+0x125/0x140
[  464.826767]  rewind_stack_do_exit+0x17/0x20
[  464.826988] Code: 48 8b 04 25 28 00 00 00 48 89 45 d0 31 c0 e8 a7 b8 06 00 48 8b 83 60 0b 00 00 48 85 c0 74 0e 48 8b 10 48 39 d0 0f 84 60 04 00 00 <0f> ff 65 8b 05 86 4f f6 6b 25 00 ff 1f 00 89 45 9c 0f 85 28 08 
[  464.827576] ---[ end trace 2b032b91acd7b853 ]---
Comment 1 Jens Axboe 2017-08-24 14:44:35 UTC
Fixed by:

commit 0c9d5b127f695818c2c5a3868c1f28ca2969e905
Author: NeilBrown <neilb@suse.com>
Date:   Thu Apr 6 12:06:37 2017 +1000

    md/raid1: avoid reusing a resync bio after error handling.
Comment 2 Jonathan G. Underwood 2017-08-24 16:33:44 UTC
(In reply to Jens Axboe from comment #1)
> Fixed by:
> 
> commit 0c9d5b127f695818c2c5a3868c1f28ca2969e905
> Author: NeilBrown <neilb@suse.com>
> Date:   Thu Apr 6 12:06:37 2017 +1000
> 
>     md/raid1: avoid reusing a resync bio after error handling.

Unfortunately that patch is already present in 4.12, and so is already in the Fedora kernel.
Comment 3 Jens Axboe 2017-08-24 16:40:05 UTC
Sounds like raid1 has another issue of submitting a bio that isn't initialized properly.
Comment 4 Shaohua Li 2017-08-24 16:52:20 UTC
can you check if below patch fixes it. It's in latest 4.12 stable

commit 022e510fcbda79183fd2cdc01abb01b4be80d03f
Author: Ming Lei <ming.lei@redhat.com>
Date:   Fri Jul 14 16:14:42 2017 +0800

    md: remove 'idx' from 'struct resync_pages'
Comment 5 Jonathan G. Underwood 2017-08-24 22:23:04 UTC
Same deal with 4.12.8 which nicludes that patch.

/proc/version:
Linux version 4.12.8-300.fc26.x86_64 (mockbuild@bkernel02.phx2.fedoraproject.org) (gcc version 7.1.1 20170622 (Red Hat 7.1.1-3) (GCC) ) #1 SMP Thu Aug 17 15:30:20 UTC 2017


[  261.081131] kernel BUG at block/blk-core.c:2054!
[  261.081194] invalid opcode: 0000 [#1] SMP
[  261.081249] Modules linked in: dm_crypt ip6t_rpfilter ip6t_REJECT nf_reject_ipv6 xt_conntrack ip_set nfnetlink ebtable_nat ebtable_broute bridge stp llc ip6table_nat nf_conntrack_ipv6 nf_defrag_ipv6 nf_nat_ipv6 ip6table_mangle ip6table_raw ip6table_security iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_nat nf_conntrack libcrc32c iptable_mangle iptable_raw iptable_security ebtable_filter ebtables ip6table_filter ip6_tables intel_rapl sunrpc x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel kvm irqbypass crct10dif_pclmul gpio_ich crc32_pclmul iTCO_wdt iTCO_vendor_support ghash_clmulni_intel intel_cstate intel_uncore intel_rapl_perf ipmi_ssif raid1 lpc_ich hpilo hpwdt ipmi_si ipmi_devintf pcc_cpufreq ipmi_msghandler acpi_power_meter tpm_tis tpm_tis_core shpchp tpm ie31200_edac i2c_algo_bit
[  261.081412]  drm_kms_helper crc32c_intel tg3 ttm serio_raw uas usb_storage drm ptp pps_core
[  261.081478] CPU: 0 PID: 471 Comm: md0_resync Not tainted 4.12.8-300.fc26.x86_64 #1
[  261.081538] Hardware name: HP ProLiant MicroServer Gen8, BIOS J06 11/02/2015
[  261.081598] task: ffff90a1ec0e0000 task.stack: ffffb26b01eac000
[  261.081660] RIP: 0010:generic_make_request+0x2c4/0x2d0
[  261.081717] RSP: 0018:ffffb26b01eafbe0 EFLAGS: 00010286
[  261.081936] RAX: ffff90a1ec0e0000 RBX: ffff90a2000c0b00 RCX: 0000000000000000
[  261.082157] RDX: 0000000000000402 RSI: 0000000000000001 RDI: ffff90a2019ac5b8
[  261.082378] RBP: ffffb26b01eafc30 R08: 0000000000000010 R09: 0000000000001000
[  261.082599] R10: ffffb26b01eafc48 R11: 00000000ffffffff R12: 0000000000000008
[  261.082820] R13: ffff90a131672000 R14: 0000000000000080 R15: ffff90a1a9c95c80
[  261.083043] FS:  0000000000000000(0000) GS:ffff90a206200000(0000) knlGS:0000000000000000
[  261.083427] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  261.083649] CR2: 0000000d3c59e0d8 CR3: 00000000e221c000 CR4: 00000000001406f0
[  261.083872] Call Trace:
[  261.084095]  ? bitmap_start_sync+0x56/0x100
[  261.084317]  raid1_sync_request+0xadf/0xb90 [raid1]
[  261.084540]  ? raid1_sync_request+0xadf/0xb90 [raid1]
[  261.084764]  ? is_mddev_idle+0xa6/0x10a
[  261.084986]  md_do_sync+0x8cd/0xee0
[  261.085210]  ? finish_wait+0x80/0x80
[  261.085435]  md_thread+0x125/0x170
[  261.085667]  ? md_thread+0x125/0x170
[  261.085899]  kthread+0x125/0x140
[  261.086128]  ? find_pers+0x70/0x70
[  261.086349]  ? kthread_park+0x60/0x60
[  261.086573]  ret_from_fork+0x25/0x30
[  261.086794] Code: bc 24 68 07 00 00 f0 49 83 ac 24 68 07 00 00 01 0f 85 85 fe ff ff 41 ff 94 24 78 07 00 00 e9 78 fe ff ff 4c 89 28 e9 ae fd ff ff <0f> 0b e8 35 6c cc ff 0f 1f 44 00 00 0f 1f 44 00 00 55 48 89 e5 
[  261.087389] RIP: generic_make_request+0x2c4/0x2d0 RSP: ffffb26b01eafbe0
[  261.087631] ---[ end trace f20eba28dd100ea0 ]---
[  261.094734] ------------[ cut here ]------------
[  261.094962] WARNING: CPU: 0 PID: 471 at kernel/exit.c:785 do_exit+0x51/0xb50
[  261.095186] Modules linked in: dm_crypt ip6t_rpfilter ip6t_REJECT nf_reject_ipv6 xt_conntrack ip_set nfnetlink ebtable_nat ebtable_broute bridge stp llc ip6table_nat nf_conntrack_ipv6 nf_defrag_ipv6 nf_nat_ipv6 ip6table_mangle ip6table_raw ip6table_security iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_nat nf_conntrack libcrc32c iptable_mangle iptable_raw iptable_security ebtable_filter ebtables ip6table_filter ip6_tables intel_rapl sunrpc x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel kvm irqbypass crct10dif_pclmul gpio_ich crc32_pclmul iTCO_wdt iTCO_vendor_support ghash_clmulni_intel intel_cstate intel_uncore intel_rapl_perf ipmi_ssif raid1 lpc_ich hpilo hpwdt ipmi_si ipmi_devintf pcc_cpufreq ipmi_msghandler acpi_power_meter tpm_tis tpm_tis_core shpchp tpm ie31200_edac i2c_algo_bit
[  261.097158]  drm_kms_helper crc32c_intel tg3 ttm serio_raw uas usb_storage drm ptp pps_core
[  261.097564] CPU: 0 PID: 471 Comm: md0_resync Tainted: G      D         4.12.8-300.fc26.x86_64 #1
[  261.097970] Hardware name: HP ProLiant MicroServer Gen8, BIOS J06 11/02/2015
[  261.098201] task: ffff90a1ec0e0000 task.stack: ffffb26b01eac000
[  261.098436] RIP: 0010:do_exit+0x51/0xb50
[  261.098662] RSP: 0018:ffffb26b01eafed8 EFLAGS: 00010202
[  261.098898] RAX: ffffb26b01eafd80 RBX: ffff90a1ec0e0000 RCX: 0000000000000000
[  261.099127] RDX: ffff90a1b5ec2400 RSI: 0000000000000000 RDI: ffffffffb3ef8d20
[  261.099355] RBP: ffffb26b01eaff48 R08: 00000000000e9000 R09: 00000000000e8025
[  261.099584] R10: 0000000000000008 R11: 0000000000000000 R12: 000000000000000b
[  261.099820] R13: ffffb26b01eafb38 R14: 0000000000000000 R15: ffffb26b01eafb38
[  261.100051] FS:  0000000000000000(0000) GS:ffff90a206200000(0000) knlGS:0000000000000000
[  261.100450] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  261.100675] CR2: 0000000d3c59e0d8 CR3: 00000000e221c000 CR4: 00000000001406f0
[  261.100915] Call Trace:
[  261.101150]  ? kthread+0x125/0x140
[  261.101377]  rewind_stack_do_exit+0x17/0x20
[  261.101608] Code: 48 8b 04 25 28 00 00 00 48 89 45 d0 31 c0 e8 d7 b8 06 00 48 8b 83 60 0b 00 00 48 85 c0 74 0e 48 8b 10 48 39 d0 0f 84 60 04 00 00 <0f> ff 65 8b 05 86 4f f6 4c 25 00 ff 1f 00 89 45 9c 0f 85 28 08 
[  261.102229] ---[ end trace f20eba28dd100ea1 ]---
Comment 6 Shaohua Li 2017-08-24 22:57:43 UTC
Created attachment 258087 [details]
debug patch

Can you try this patch?
Comment 7 Shaohua Li 2017-08-24 23:03:03 UTC
Created attachment 258089 [details]
debug patch

alright, we need set the bio private
Comment 8 Jonathan G. Underwood 2017-08-24 23:18:30 UTC
Thanks for the patch. Unfortunately I have to go to bed (it's already gone midnight), and then will be travelling for a few days, but I'll try the patch when I get back.
Comment 9 Jonathan G. Underwood 2017-08-24 23:33:27 UTC
Is that patch against 4.12.8, or latest upstream?
Comment 10 Shaohua Li 2017-08-24 23:42:02 UTC
For upstream. Looks I still didn't set the correctly bio private. I'll cook a new patch.
Comment 11 Shaohua Li 2017-08-25 02:43:54 UTC
I'll merge this patch to upstream. I think it should fix your issue:
https://marc.info/?l=linux-raid&m=150362889201103&w=2
Comment 12 Jonathan G. Underwood 2017-08-29 19:50:29 UTC
OK, I applied that against the Fedora 4.12.9 kernel and have ran the dd command without the BUG appearing - so far I have written 50GB of data. Previously this would have triggered the BUG after a few GB reproducibly, so it looks like this has fixed things. 

I'll leave it running overnight to truly confirm.

Thanks for your help!
Comment 13 Jonathan G. Underwood 2017-08-29 20:10:55 UTC
ps. I notice this isn't scheduled for the 4.2.10 stable release - is that an oversight, or was the patch just too late?
Comment 14 Shaohua Li 2017-08-30 04:45:17 UTC
I run a little late to push the patches to upstream. Will push this to 4.14 and this will be picked up by stable.

Note You need to log in before you can comment on or make changes to this bug.