Bug 196077 - mkfs.xfs on lv mirrior which include SSD will lead to NULL pointer
Summary: mkfs.xfs on lv mirrior which include SSD will lead to NULL pointer
Status: RESOLVED CODE_FIX
Alias: None
Product: IO/Storage
Classification: Unclassified
Component: LVM2/DM (show other bugs)
Hardware: x86-64 Linux
: P1 normal
Assignee: Mike Snitzer
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2017-06-15 04:19 UTC by yizhan
Modified: 2017-06-21 20:03 UTC (History)
1 user (show)

See Also:
Kernel Version: 4.12.0-rc5
Subsystem:
Regression: No
Bisected commit-id:


Attachments

Description yizhan 2017-06-15 04:19:46 UTC
mkfs.xfs on lv mirrior which include SSD will lead to NULL pointer

How reproducible:
100%


Steps to Reproduce:
1) 3 SSD: sd[n-p] 
#vgcreate helter_skelter /dev/sdl1  /dev/sd[n-p]1
#lvcreate -ay --type mirror -m 2 -n syncd_multiple_legs_3legs_1 -L 500M helter_skelter /dev/sdl1:0-2400 /dev/sdn1:0-2400 /dev/sdo1:0-2400 /dev/sdp1:0-150
#mkfs.xfs /dev/mapper/helter_skelter-syncd_multiple_legs_3legs_1

[  344.300595] BUG: unable to handle kernel NULL pointer dereference at           (null)
[  344.309351] IP: write_callback+0x21/0x100 [dm_mirror]
[  344.314988] PGD 0 
[  344.314989] P4D 0 
[  344.317232] 
[  344.321134] Oops: 0000 [#1] SMP
[  344.324640] Modules linked in: intel_rapl sb_edac x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel kvm irqbypass crct10dif_pclmul crc32_pclmul ghash_clmulni_inted
[  344.388414] CPU: 1 PID: 0 Comm: swapper/1 Not tainted 4.12.0-rc5 #2
[  344.395409] Hardware name: Dell Inc. PowerEdge R730/0599V5, BIOS 1.0.4 08/28/2014
[  344.403759] task: ffffa0bc79f15a00 task.stack: ffffb8a3400e0000
[  344.410367] RIP: 0010:write_callback+0x21/0x100 [dm_mirror]
[  344.416586] RSP: 0018:ffffa0cbffc03c70 EFLAGS: 00010282
[  344.422419] RAX: 0000000000000000 RBX: ffffa0cbb7d84058 RCX: 0000000000002242
[  344.430372] RDX: 0000000000002241 RSI: ffffa0cbb7d84058 RDI: ffffa0cbb7ed3680
[  344.438334] RBP: ffffa0cbffc03c98 R08: 000000000001f040 R09: ffffffff899addd7
[  344.446298] R10: ffffa0cbffc1f040 R11: ffffe7f580dfb4c0 R12: ffffa0cbb7d84058
[  344.454260] R13: ffffffffc03a4a30 R14: ffffa0cbbb350000 R15: 000000000000315c
[  344.462224] FS:  0000000000000000(0000) GS:ffffa0cbffc00000(0000) knlGS:0000000000000000
[  344.471253] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  344.477663] CR2: 0000000000000000 CR3: 0000000d27a09000 CR4: 00000000001406e0
[  344.485628] Call Trace:
[  344.488358]  <IRQ>
[  344.490604]  ? recovery_complete+0x160/0x160 [dm_mirror]
[  344.496538]  dec_count+0x40/0x50 [dm_mod]
[  344.501016]  endio+0x41/0x60 [dm_mod]
[  344.505112]  bio_endio+0x83/0x100
[  344.508805]  dec_pending+0x134/0x240 [dm_mod]
[  344.513674]  ? mempool_free+0x2b/0x80
[  344.517763]  clone_endio+0x5b/0x120 [dm_mod]
[  344.522528]  bio_endio+0x83/0x100
[  344.526230]  blk_update_request+0x8c/0x330
[  344.530805]  scsi_end_request+0x34/0x1d0
[  344.535181]  scsi_io_completion+0x13b/0x620
[  344.539850]  scsi_finish_command+0xd9/0x120
[  344.544516]  scsi_softirq_done+0x12a/0x150
[  344.549093]  blk_done_softirq+0x9e/0xd0
[  344.553376]  __do_softirq+0xc9/0x269
[  344.557369]  irq_exit+0xd9/0xf0
[  344.560871]  smp_call_function_single_interrupt+0x33/0x40
[  344.566904]  call_function_single_interrupt+0x93/0xa0
[  344.572546] RIP: 0010:cpuidle_enter_state+0xe4/0x270
[  344.578083] RSP: 0018:ffffb8a3400e3e68 EFLAGS: 00000246 ORIG_RAX: ffffffffffffff04
[  344.586531] RAX: ffffa0cbffc1bc00 RBX: ffffd8a33fc00300 RCX: 000000000000001f
[  344.594492] RDX: 0000000000000000 RSI: ffffa0cbffc19418 RDI: 0000000000000000
[  344.602453] RBP: ffffb8a3400e3ea0 R08: 0000000000000001 R09: cccccccccccccccd
[  344.610414] R10: 0000000000002bc2 R11: 0000000000000018 R12: 0000000000000004
[  344.618375] R13: 0000000000000001 R14: ffffd8a33fc00300 R15: 0000005029e9e504
[  344.626337]  </IRQ>
[  344.628677]  ? cpuidle_enter_state+0xc0/0x270
[  344.633538]  cpuidle_enter+0x17/0x20
[  344.637535]  call_cpuidle+0x23/0x40
[  344.641427]  do_idle+0x172/0x1e0
[  344.645029]  cpu_startup_entry+0x71/0x80
[  344.649414]  start_secondary+0x154/0x190
[  344.653797]  secondary_startup_64+0x9f/0x9f
[  344.658462] Code: 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 55 48 89 e5 41 56 41 55 41 54 53 48 89 f3 48 83 ec 08 48 8b 06 48 85 ff 48 89 7d d8 <4c> 8b 20 48 c7 06 00 
[  344.679558] RIP: write_callback+0x21/0x100 [dm_mirror] RSP: ffffa0cbffc03c70
[  344.687421] CR2: 0000000000000000
[  344.691132] ---[ end trace 6a9619dc0b776113 ]---
[  344.767346] Kernel panic - not syncing: Fatal exception in interrupt
[  344.774490] Kernel Offset: 0x8800000 from 0xffffffff81000000 (relocation range: 0xffffffff80000000-0xffffffffbfffffff)
[  344.856848] ---[ end Kernel panic - not syncing: Fatal exception in interrupt
[  344.864835] sched: Unexpected reschedule of offline CPU#0!
[  344.870963] ------------[ cut here ]------------
[  344.876119] WARNING: CPU: 1 PID: 0 at arch/x86/kernel/smp.c:128 native_smp_send_reschedule+0x3c/0x40
[  344.886309] Modules linked in: intel_rapl sb_edac x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel kvm irqbypass crct10dif_pclmul crc32_pclmul ghash_clmulni_inted
[  344.950057] CPU: 1 PID: 0 Comm: swapper/1 Tainted: G      D         4.12.0-rc5 #2
[  344.958406] Hardware name: Dell Inc. PowerEdge R730/0599V5, BIOS 1.0.4 08/28/2014
[  344.966754] task: ffffa0bc79f15a00 task.stack: ffffb8a3400e0000
[  344.973360] RIP: 0010:native_smp_send_reschedule+0x3c/0x40
[  344.979482] RSP: 0018:ffffa0cbffc03768 EFLAGS: 00010046
[  344.985312] RAX: 000000000000002e RBX: 0000000000000000 RCX: 0000000000000000
[  344.993273] RDX: 0000000000000000 RSI: 00000000000000f6 RDI: 0000000000000300
[  345.001233] RBP: ffffa0cbffc03768 R08: 00000000fffffffe R09: 00000000000005f4
[  345.009195] R10: 0000000000000005 R11: 00000000000005f3 R12: 0000000000000001
[  345.017157] R13: 000000010000ae9e R14: ffffa0bc79f15a00 R15: ffffa0cbffc147a8
[  345.025119] FS:  0000000000000000(0000) GS:ffffa0cbffc00000(0000) knlGS:0000000000000000
[  345.034146] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  345.040555] CR2: 0000000000000000 CR3: 0000000d27a09000 CR4: 00000000001406e0
[  345.048518] Call Trace:
[  345.051243]  <IRQ>
[  345.053492]  trigger_load_balance+0x10e/0x1f0
[  345.058358]  scheduler_tick+0xa8/0xe0
[  345.062448]  ? tick_sched_do_timer+0x70/0x70
[  345.067216]  update_process_times+0x47/0x60
[  345.071884]  tick_sched_handle.isra.18+0x25/0x60
[  345.077035]  tick_sched_timer+0x3d/0x70
[  345.081314]  __hrtimer_run_queues+0xe5/0x230
[  345.086077]  hrtimer_interrupt+0xa8/0x1a0
[  345.090554]  local_apic_timer_interrupt+0x35/0x60
[  345.095801]  smp_apic_timer_interrupt+0x38/0x50
[  345.100859]  apic_timer_interrupt+0x93/0xa0
[  345.105528] RIP: 0010:panic+0x1f8/0x239
[  345.109795] RSP: 0018:ffffa0cbffc039c0 EFLAGS: 00000246 ORIG_RAX: ffffffffffffff10
[  345.118241] RAX: 0000000000000041 RBX: 0000000000000000 RCX: 0000000000000006
[  345.126201] RDX: 0000000000000000 RSI: 0000000000000096 RDI: ffffa0cbffc0e0a0
[  345.134163] RBP: ffffa0cbffc03a30 R08: 00000000fffffffe R09: 00000000000005f3
[  345.142122] R10: 0000000000000005 R11: 00000000000005f2 R12: ffffffff8a236e11
[  345.150083] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000046
[  345.158056]  oops_end+0xc6/0xd0
[  345.161566]  no_context+0x1a8/0x400
[  345.165457]  __bad_area_nosemaphore+0xee/0x1d0
[  345.170413]  bad_area_nosemaphore+0x14/0x20
[  345.175081]  __do_page_fault+0x89/0x4a0
[  345.179361]  ? __wake_up_common+0x55/0x90
[  345.183827]  do_page_fault+0x30/0x80
[  345.187815]  ? recovery_complete+0x160/0x160 [dm_mirror]
[  345.193746]  page_fault+0x28/0x30
[  345.197444] RIP: 0010:write_callback+0x21/0x100 [dm_mirror]
[  345.203661] RSP: 0018:ffffa0cbffc03c70 EFLAGS: 00010282
[  345.209489] RAX: 0000000000000000 RBX: ffffa0cbb7d84058 RCX: 0000000000002242
[  345.217450] RDX: 0000000000002241 RSI: ffffa0cbb7d84058 RDI: ffffa0cbb7ed3680
[  345.225412] RBP: ffffa0cbffc03c98 R08: 000000000001f040 R09: ffffffff899addd7
[  345.233372] R10: ffffa0cbffc1f040 R11: ffffe7f580dfb4c0 R12: ffffa0cbb7d84058
[  345.241333] R13: ffffffffc03a4a30 R14: ffffa0cbbb350000 R15: 000000000000315c
[  345.249299]  ? recovery_complete+0x160/0x160 [dm_mirror]
[  345.255227]  ? mempool_free_slab+0x17/0x20
[  345.259788]  ? recovery_complete+0x160/0x160 [dm_mirror]
[  345.265938]  dec_count+0x40/0x50 [dm_mod]
[  345.270412]  endio+0x41/0x60 [dm_mod]
[  345.274498]  bio_endio+0x83/0x100
[  345.278194]  dec_pending+0x134/0x240 [dm_mod]
[  345.283053]  ? mempool_free+0x2b/0x80
[  345.287140]  clone_endio+0x5b/0x120 [dm_mod]
[  345.291902]  bio_endio+0x83/0x100
[  345.295596]  blk_update_request+0x8c/0x330
[  345.300164]  scsi_end_request+0x34/0x1d0
[  345.304538]  scsi_io_completion+0x13b/0x620
[  345.309202]  scsi_finish_command+0xd9/0x120
[  345.313867]  scsi_softirq_done+0x12a/0x150
[  345.318435]  blk_done_softirq+0x9e/0xd0
[  345.322713]  __do_softirq+0xc9/0x269
[  345.326700]  irq_exit+0xd9/0xf0
[  345.330201]  smp_call_function_single_interrupt+0x33/0x40
[  345.336224]  call_function_single_interrupt+0x93/0xa0
[  345.341857] RIP: 0010:cpuidle_enter_state+0xe4/0x270
[  345.347394] RSP: 0018:ffffb8a3400e3e68 EFLAGS: 00000246 ORIG_RAX: ffffffffffffff04
[  345.355838] RAX: ffffa0cbffc1bc00 RBX: ffffd8a33fc00300 RCX: 000000000000001f
[  345.363799] RDX: 0000000000000000 RSI: ffffa0cbffc19418 RDI: 0000000000000000
[  345.371758] RBP: ffffb8a3400e3ea0 R08: 0000000000000001 R09: cccccccccccccccd
[  345.379719] R10: 0000000000002bc2 R11: 0000000000000018 R12: 0000000000000004
[  345.387680] R13: 0000000000000001 R14: ffffd8a33fc00300 R15: 0000005029e9e504
[  345.395638]  </IRQ>
[  345.397976]  ? cpuidle_enter_state+0xc0/0x270
[  345.402834]  cpuidle_enter+0x17/0x20
[  345.406820]  call_cpuidle+0x23/0x40
[  345.410709]  do_idle+0x172/0x1e0
[  345.414310]  cpu_startup_entry+0x71/0x80
[  345.418685]  start_secondary+0x154/0x190
[  345.423058]  secondary_startup_64+0x9f/0x9f
[  345.427721] Code: dd 00 0f 92 c0 84 c0 74 14 48 8b 05 df 48 a9 00 be fd 00 00 00 ff 90 a0 00 00 00 5d c3 89 fe 48 c7 c7 f8 ca 22 8a e8 c3 41 15 00 <0f> ff 5d c3 0f 1f 44 
[  345.448790] ---[ end trace 6a9619dc0b776114 ]---


The SSD I used:

[    6.783423] scsi 0:0:12:0: Direct-Access     ATA      INTEL SSDSC2BB48 DL13 PQ: 0 ANSI: 6
[    6.807242] scsi 0:0:13:0: Direct-Access     ATA      INTEL SSDSC2BB48 DL13 PQ: 0 ANSI: 6
[    6.831367] scsi 0:0:14:0: Direct-Access     ATA      INTEL SSDSC2BB48 DL13 PQ: 0 ANSI: 6
[    6.855349] scsi 0:0:15:0: Direct-Access     ATA      INTEL SSDSC2BB48 DL13 PQ: 0 ANSI: 6
[    6.968282] sd 0:0:12:0: [sdm] 937703088 512-byte logical blocks: (480 GB/447 GiB)
[    6.968283] sd 0:0:12:0: [sdm] 4096-byte physical blocks
[    6.969092] sd 0:0:12:0: [sdm] Write Protect is off
[    6.969093] sd 0:0:12:0: [sdm] Mode Sense: 9b 00 10 08
[    6.969095] sd 0:0:13:0: [sdn] 937703088 512-byte logical blocks: (480 GB/447 GiB)
[    6.969097] sd 0:0:13:0: [sdn] 4096-byte physical blocks
[    6.969351] sd 0:0:14:0: [sdo] 937703088 512-byte logical blocks: (480 GB/447 GiB)
[    6.969352] sd 0:0:14:0: [sdo] 4096-byte physical blocks
[    6.969554] sd 0:0:15:0: [sdp] 937703088 512-byte logical blocks: (480 GB/447 GiB)
[    6.969556] sd 0:0:15:0: [sdp] 4096-byte physical blocks
[    6.969655] sd 0:0:12:0: [sdm] Write cache: enabled, read cache: enabled, supports DPO and FUA
[    6.969800] sd 0:0:13:0: [sdn] Write Protect is off
[    6.969801] sd 0:0:13:0: [sdn] Mode Sense: 9b 00 10 08
[    6.969990] sd 0:0:14:0: [sdo] Write Protect is off
[    6.969992] sd 0:0:14:0: [sdo] Mode Sense: 9b 00 10 08
[    6.970281] sd 0:0:15:0: [sdp] Write Protect is off
[    6.970283] sd 0:0:15:0: [sdp] Mode Sense: 9b 00 10 08
[    6.970437] sd 0:0:13:0: [sdn] Write cache: enabled, read cache: enabled, supports DPO and FUA
[    6.970608] sd 0:0:14:0: [sdo] Write cache: enabled, read cache: enabled, supports DPO and FUA
[    6.970722] sd 0:0:15:0: [sdp] Write cache: enabled, read cache: enabled, supports DPO and FUA
[    7.005461] sd 0:0:12:0: [sdm] Attached SCSI disk
[    7.007395] sd 0:0:13:0: [sdn] Attached SCSI disk
[    7.007534] sd 0:0:14:0: [sdo] Attached SCSI disk
[    7.007683] sd 0:0:15:0: [sdp] Attached SCSI disk
[   20.349557] sd 0:0:12:0: Attached scsi generic sg12 type 0
[   20.355741] sd 0:0:13:0: Attached scsi generic sg13 type 0
[   20.361946] sd 0:0:14:0: Attached scsi generic sg14 type 0
[   20.368183] sd 0:0:15:0: Attached scsi generic sg15 type 0
Comment 1 Mike Snitzer 2017-06-19 14:42:01 UTC
This was reported via RedHat QE on 6/14/2017 so I was aware of this.

I'll work on a fix and will report back once I it.
Comment 2 Mike Snitzer 2017-06-19 14:44:43 UTC
(In reply to Mike Snitzer from comment #1)
> This was reported via RedHat QE on 6/14/2017 so I was aware of this.
> 
> I'll work on a fix and will report back once I it.

Heh, Zhang Yi... you already reported this.  Not sure why we need a kernel.org bug too.

Please don't do this in the future!  If you'd like to raise awareness, or ping developers if you feel we've been unresponsive, post to dm-devel@redhat.com
Comment 3 yizhan 2017-06-19 14:55:47 UTC
(In reply to Mike Snitzer from comment #2)
> (In reply to Mike Snitzer from comment #1)
> > This was reported via RedHat QE on 6/14/2017 so I was aware of this.
> > 
> > I'll work on a fix and will report back once I it.
> 
> Heh, Zhang Yi... you already reported this.  Not sure why we need a
> kernel.org bug too.
> 
Hi Mike
I want to track this issue with upstream.

> Please don't do this in the future!  If you'd like to raise awareness, or
> ping developers if you feel we've been unresponsive, post to
> dm-devel@redhat.com

OK, I will post to dm-devel@ next time, thanks for reminder.

Yi
Comment 4 Mike Snitzer 2017-06-19 15:52:29 UTC
I reproduced on my testbed.  It is definitely an issue with how dm-mirror is processing discards (via dm-io).

If I stop having mkfs.xfs issue discards (using mkfs.xfs -K) then I don't see the crash.  If I let mkfs.xfs issue discards it crashes as reported.

Given it is so easy to reproduce I expect to have a fix by end of day.

Note You need to log in before you can comment on or make changes to this bug.