Bug 218186 - divide error in blk_stack_limits() hit during fio workload against nvmf target
Summary: divide error in blk_stack_limits() hit during fio workload against nvmf target
Status: NEW
Alias: None
Product: IO/Storage
Classification: Unclassified
Component: NVMe (show other bugs)
Hardware: All Linux
: P3 normal
Assignee: IO/NVME Virtual Default Assignee
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2023-11-24 10:16 UTC by michallinuxstuff
Modified: 2023-11-28 12:19 UTC (History)
2 users (show)

See Also:
Kernel Version:
Subsystem:
Regression: No
Bisected commit-id:


Attachments

Description michallinuxstuff 2023-11-24 10:16:08 UTC
OS: Fedora38
Kernel: 6.5.12 (fedora build; 6.5.12-200.fc38.x86_64)

This issue happens intermittently, during fio workload (haven't found a way to reliably reproduce it other than spinning these tests until the panic occurs).

Trace looks like the following (kernel is tainted due to out-of-tree ICE and QAT drivers, just fyi (devices that are bound to those are not in use though)):


2023-11-24T06:52:45+01:00	Nov 24 05:52:45 10.211.11.214 [ 2274.131237] nvme nvme0: Identify Descriptors failed (nsid=3, status=0xb)
2023-11-24T06:52:45+01:00	Nov 24 05:52:45 10.211.11.214 [ 2274.179365] nvme nvme0: rescanning namespaces.
2023-11-24T06:52:46+01:00	Nov 24 05:52:45 10.211.11.214 [ 2274.354494] nvme0c0n1: I/O Cmd(0x2) @ LBA 704, 8 blocks, I/O Error (sct 0x0 / sc 0xb) DNR 
2023-11-24T06:52:46+01:00	Nov 24 05:52:45 10.211.11.214 [ 2274.364817] critical target error, dev nvme0c0n1, sector 704 op 0x0:(READ) flags 0x2000000 phys_seg 1 prio class 2
2023-11-24T06:52:46+01:00	Nov 24 05:52:45 10.211.11.214 [ 2274.409035] nvme0n1: detected capacity change from 131072 to 0
2023-11-24T06:52:46+01:00	Nov 24 05:52:45 10.211.11.214 [ 2274.416061] divide error: 0000 [#1] PREEMPT SMP PTI
2023-11-24T06:52:46+01:00	Nov 24 05:52:45 10.211.11.214 [ 2274.421939] CPU: 4 PID: 11 Comm: kworker/u321:0 Tainted: G           OE      6.5.12-100.fc37.x86_64 #1
2023-11-24T06:52:46+01:00	Nov 24 05:52:46 10.211.11.214 [ 2274.432767] Hardware name: Intel Corporation S2600GZ/S2600GZ, BIOS SE5C600.86B.02.01.0002.082220131453 08/22/2013
2023-11-24T06:52:46+01:00	Nov 24 05:52:46 10.211.11.214 [ 2274.444662] Workqueue: nvme-wq nvme_scan_work [nvme_core]
2023-11-24T06:52:46+01:00	Nov 24 05:52:46 10.211.11.214 [ 2274.451154] RIP: 0010:blk_stack_limits+0x19e/0x4d0
2023-11-24T06:52:46+01:00	Nov 24 05:52:46 10.211.11.214 [ 2274.456945] Code: b6 41 6a 08 43 6a 8b 41 30 8b 71 3c 8b 79 38 44 8b 4b 38 39 c6 0f 42 f0 48 8b 04 24 31 d2 45 31 ff 41 89 f0 01 f7 41 c1 e8 09 <49> f7 f0 44 8b 43 30 89 f8 8b 7b 3c c1 e2 09 29 d0 31 d2 f7 f6 89
2023-11-24T06:52:46+01:00	Nov 24 05:52:46 10.211.11.214 [ 2274.478807] RSP: 0018:ffffb90d800c3c20 EFLAGS: 00010246
2023-11-24T06:52:46+01:00	Nov 24 05:52:46 10.211.11.214 [ 2274.485097] RAX: 0000000000000000 RBX: ffff8c5e48fe39e8 RCX: ffff8c5e48fe1be8
2023-11-24T06:52:46+01:00	Nov 24 05:52:46 10.211.11.214 [ 2274.493523] RDX: 0000000000000000 RSI: 0000000000000001 RDI: 0000000000000001
2023-11-24T06:52:46+01:00	Nov 24 05:52:46 10.211.11.214 [ 2274.501937] RBP: 0000000000000000 R08: 0000000000000000 R09: 0000000000000000
2023-11-24T06:52:46+01:00	Nov 24 05:52:46 10.211.11.214 [ 2274.510353] R10: 0000000000000c30 R11: 00000000c12c4cc9 R12: 00000000ffffffff
2023-11-24T06:52:46+01:00	Nov 24 05:52:46 10.211.11.214 [ 2274.518768] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000
2023-11-24T06:52:46+01:00	Nov 24 05:52:46 10.211.11.214 [ 2274.527182] FS:  0000000000000000(0000) GS:ffff8c5d5eb00000(0000) knlGS:0000000000000000
2023-11-24T06:52:46+01:00	Nov 24 05:52:46 10.211.11.214 [ 2274.536663] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
2023-11-24T06:52:46+01:00	Nov 24 05:52:46 10.211.11.214 [ 2274.543524] CR2: 00007f925b190f5c CR3: 00000001fd222005 CR4: 00000000001706e0
2023-11-24T06:52:46+01:00	Nov 24 05:52:46 10.211.11.214 [ 2274.551939] Call Trace:
2023-11-24T06:52:46+01:00	Nov 24 05:52:46 10.211.11.214 [ 2274.555110]  <TASK>
2023-11-24T06:52:46+01:00	Nov 24 05:52:46 10.211.11.214 [ 2274.557886]  ? die+0x36/0x90
2023-11-24T06:52:46+01:00	Nov 24 05:52:46 10.211.11.214 [ 2274.561539]  ? do_trap+0xda/0x100
2023-11-24T06:52:46+01:00	Nov 24 05:52:46 10.211.11.214 [ 2274.565670]  ? blk_stack_limits+0x19e/0x4d0
2023-11-24T06:52:46+01:00	Nov 24 05:52:46 10.211.11.214 [ 2274.570766]  ? do_error_trap+0x6a/0x90
2023-11-24T06:52:46+01:00	Nov 24 05:52:46 10.211.11.214 [ 2274.575367]  ? blk_stack_limits+0x19e/0x4d0
2023-11-24T06:52:46+01:00	Nov 24 05:52:46 10.211.11.214 [ 2274.580447]  ? exc_divide_error+0x38/0x50
2023-11-24T06:52:46+01:00	Nov 24 05:52:46 10.211.11.214 [ 2274.585332]  ? blk_stack_limits+0x19e/0x4d0
2023-11-24T06:52:46+01:00	Nov 24 05:52:46 10.211.11.214 [ 2274.590405]  ? asm_exc_divide_error+0x1a/0x20
2023-11-24T06:52:46+01:00	Nov 24 05:52:46 10.211.11.214 [ 2274.595668]  ? blk_stack_limits+0x19e/0x4d0
2023-11-24T06:52:46+01:00	Nov 24 05:52:46 10.211.11.214 [ 2274.600726]  ? __queue_work+0x1e0/0x450
2023-11-24T06:52:46+01:00	Nov 24 05:52:46 10.211.11.214 [ 2274.605402]  nvme_update_ns_info_block+0x457/0x680 [nvme_core]
2023-11-24T06:52:46+01:00	Nov 24 05:52:46 10.211.11.214 [ 2274.612330]  nvme_scan_ns+0x1ec/0xde0 [nvme_core]
2023-11-24T06:52:46+01:00	Nov 24 05:52:46 10.211.11.214 [ 2274.618059]  nvme_scan_work+0x2a6/0x5e0 [nvme_core]
2023-11-24T06:52:46+01:00	Nov 24 05:52:46 10.211.11.214 [ 2274.623984]  process_one_work+0x1e2/0x3e0
2023-11-24T06:52:46+01:00	Nov 24 05:52:46 10.211.11.214 [ 2274.628870]  worker_thread+0x1da/0x390
2023-11-24T06:52:46+01:00	Nov 24 05:52:46 10.211.11.214 [ 2274.633413] nvme0c0n2: I/O Cmd(0x2) @ LBA 888, 8 blocks, I/O Error (sct 0x0 / sc 0xb) DNR 
2023-11-24T06:52:46+01:00	Nov 24 05:52:46 10.211.11.214 [ 2274.633456]  ? __pfx_worker_thread+0x10/0x10
2023-11-24T06:52:46+01:00	Nov 24 05:52:46 10.211.11.214 [ 2274.642692] critical target error, dev nvme0c0n2, sector 888 op 0x0:(READ) flags 0x2000000 phys_seg 1 prio class 2
2023-11-24T06:52:46+01:00	Nov 24 05:52:46 10.211.11.214 [ 2274.647446]  kthread+0xe8/0x120
2023-11-24T06:52:46+01:00	Nov 24 05:52:46 10.211.11.214 [ 2274.647452]  ? __pfx_kthread+0x10/0x10
2023-11-24T06:52:46+01:00	Nov 24 05:52:46 10.211.11.214 [ 2274.668700]  ret_from_fork+0x34/0x50
2023-11-24T06:52:46+01:00	Nov 24 05:52:46 10.211.11.214 [ 2274.673082]  ? __pfx_kthread+0x10/0x10
2023-11-24T06:52:46+01:00	Nov 24 05:52:46 10.211.11.214 [ 2274.677645]  ret_from_fork_asm+0x1b/0x30
2023-11-24T06:52:46+01:00	Nov 24 05:52:46 10.211.11.214 [ 2274.682400]  </TASK>
2023-11-24T06:52:46+01:00	Nov 24 05:52:46 10.211.11.214 [ 2274.685200] Modules linked in: nvme_tcp nvme_fabrics xfs iptable_filter bridge stp llc veth qat_c62xvf(OE) vfio_pci vfio_pci_core vfio_iommu_type1 vfio iommufd nbd rfkill usdm_drv(OE) sunrpc binfmt_misc intel_rapl_msr intel_rapl_common sb_edac x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel kvm irqbypass crct10dif_pclmul crc32_pclmul crc32c_intel polyval_clmulni polyval_generic qat_c62x(OE) ghash_clmulni_intel sha512_ssse3 intel_qat(OE) i40e rapl ib_uverbs ipmi_si iTCO_wdt intel_cstate intel_pmc_bxt mei_me ipmi_devintf iTCO_vendor_support intel_uncore ib_core i2c_i801 pcspkr mei ipmi_msghandler mgag200 joydev ioatdma dax_pmem lpc_ich uio i2c_smbus wmi ip6_tables ip_tables fuse zram bpf_preload loop overlay squashfs netconsole nd_pmem nd_btt nd_e820 libnvdimm virtio_blk virtio_net net_failover failover uas usb_storage ice(OE) gnss nvme nvme_core nvme_common mlx5_core mlxfw psample tls pci_hyperv_intf ixgbe mdio igb i2c_algo_bit dca [last unloaded: nvme_fabrics]
2023-11-24T06:52:46+01:00	Nov 24 05:52:46 10.211.11.214 [ 2274.784217] ---[ end trace 0000000000000000 ]---
2023-11-24T06:52:46+01:00	Nov 24 05:52:46 10.211.11.214 [ 2274.792888] RIP: 0010:blk_stack_limits+0x19e/0x4d0
2023-11-24T06:52:46+01:00	Nov 24 05:52:46 10.211.11.214 [ 2274.798643] Code: b6 41 6a 08 43 6a 8b 41 30 8b 71 3c 8b 79 38 44 8b 4b 38 39 c6 0f 42 f0 48 8b 04 24 31 d2 45 31 ff 41 89 f0 01 f7 41 c1 e8 09 <49> f7 f0 44 8b 43 30 89 f8 8b 7b 3c c1 e2 09 29 d0 31 d2 f7 f6 89
2023-11-24T06:52:46+01:00	Nov 24 05:52:46 10.211.11.214 [ 2274.820412] RSP: 0018:ffffb90d800c3c20 EFLAGS: 00010246
2023-11-24T06:52:46+01:00	Nov 24 05:52:46 10.211.11.214 [ 2274.826646] RAX: 0000000000000000 RBX: ffff8c5e48fe39e8 RCX: ffff8c5e48fe1be8
2023-11-24T06:52:46+01:00	Nov 24 05:52:46 10.211.11.214 [ 2274.835011] RDX: 0000000000000000 RSI: 0000000000000001 RDI: 0000000000000001
2023-11-24T06:52:46+01:00	Nov 24 05:52:46 10.211.11.214 [ 2274.843383] RBP: 0000000000000000 R08: 0000000000000000 R09: 0000000000000000
2023-11-24T06:52:46+01:00	Nov 24 05:52:46 10.211.11.214 [ 2274.851753] R10: 0000000000000c30 R11: 00000000c12c4cc9 R12: 00000000ffffffff
2023-11-24T06:52:46+01:00	Nov 24 05:52:46 10.211.11.214 [ 2274.860125] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000
2023-11-24T06:52:46+01:00	Nov 24 05:52:46 10.211.11.214 [ 2274.868492] FS:  0000000000000000(0000) GS:ffff8c5d5eb00000(0000) knlGS:0000000000000000
2023-11-24T06:52:46+01:00	Nov 24 05:52:46 10.211.11.214 [ 2274.877931] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
2023-11-24T06:52:46+01:00	Nov 24 05:52:46 10.211.11.214 [ 2274.884748] CR2: 00007f925b190f5c CR3: 00000001fd222005 CR4: 00000000001706e0
2023-11-24T06:52:46+01:00	Nov 24 05:52:46 10.211.11.214 [ 2274.893207] Kernel panic - not syncing: Fatal exception
2023-11-24T06:52:46+01:00	Nov 24 05:52:46 10.211.11.214 [ 2274.899569] Kernel Offset: 0x3a000000 from 0xffffffff81000000 (relocation range: 0xffffffff80000000-0xffffffffbfffffff)
2023-11-24T06:52:46+01:00	Nov 24 05:52:46 10.211.11.214 [ 2274.915130] Rebooting in 5 seconds..


It looks like fedora applies some patches under driver/nvme/* but from the quick look at them they don't seem to be relevant in the context of the above failure.

Any hints regarding root cause and/or potential fixes would be appreciated. :)
Comment 1 Bagas Sanjaya 2023-11-24 14:08:39 UTC
(In reply to michallinuxstuff from comment #0)
> OS: Fedora38
> Kernel: 6.5.12 (fedora build; 6.5.12-200.fc38.x86_64)
> 
> This issue happens intermittently, during fio workload (haven't found a way
> to reliably reproduce it other than spinning these tests until the panic
> occurs).
> 
> Trace looks like the following (kernel is tainted due to out-of-tree ICE and
> QAT drivers, just fyi (devices that are bound to those are not in use
> though)):
> 
> 
> 2023-11-24T06:52:45+01:00     Nov 24 05:52:45 10.211.11.214 [ 2274.131237]
> nvme
> nvme0: Identify Descriptors failed (nsid=3, status=0xb)
> 2023-11-24T06:52:45+01:00     Nov 24 05:52:45 10.211.11.214 [ 2274.179365]
> nvme
> nvme0: rescanning namespaces.
> 2023-11-24T06:52:46+01:00     Nov 24 05:52:45 10.211.11.214 [ 2274.354494]
> nvme0c0n1: I/O Cmd(0x2) @ LBA 704, 8 blocks, I/O Error (sct 0x0 / sc 0xb)
> DNR 
> 2023-11-24T06:52:46+01:00     Nov 24 05:52:45 10.211.11.214 [ 2274.364817]
> critical target error, dev nvme0c0n1, sector 704 op 0x0:(READ) flags
> 0x2000000 phys_seg 1 prio class 2
> 2023-11-24T06:52:46+01:00     Nov 24 05:52:45 10.211.11.214 [ 2274.409035]
> nvme0n1: detected capacity change from 131072 to 0
> 2023-11-24T06:52:46+01:00     Nov 24 05:52:45 10.211.11.214 [ 2274.416061]
> divide error: 0000 [#1] PREEMPT SMP PTI
> 2023-11-24T06:52:46+01:00     Nov 24 05:52:45 10.211.11.214 [ 2274.421939]
> CPU:
> 4 PID: 11 Comm: kworker/u321:0 Tainted: G           OE     
> 6.5.12-100.fc37.x86_64 #1
> 2023-11-24T06:52:46+01:00     Nov 24 05:52:46 10.211.11.214 [ 2274.432767]
> Hardware name: Intel Corporation S2600GZ/S2600GZ, BIOS
> SE5C600.86B.02.01.0002.082220131453 08/22/2013
> 2023-11-24T06:52:46+01:00     Nov 24 05:52:46 10.211.11.214 [ 2274.444662]
> Workqueue: nvme-wq nvme_scan_work [nvme_core]
> 2023-11-24T06:52:46+01:00     Nov 24 05:52:46 10.211.11.214 [ 2274.451154]
> RIP:
> 0010:blk_stack_limits+0x19e/0x4d0
> 2023-11-24T06:52:46+01:00     Nov 24 05:52:46 10.211.11.214 [ 2274.456945]
> Code:
> b6 41 6a 08 43 6a 8b 41 30 8b 71 3c 8b 79 38 44 8b 4b 38 39 c6 0f 42 f0 48
> 8b 04 24 31 d2 45 31 ff 41 89 f0 01 f7 41 c1 e8 09 <49> f7 f0 44 8b 43 30 89
> f8 8b 7b 3c c1 e2 09 29 d0 31 d2 f7 f6 89
> 2023-11-24T06:52:46+01:00     Nov 24 05:52:46 10.211.11.214 [ 2274.478807]
> RSP:
> 0018:ffffb90d800c3c20 EFLAGS: 00010246
> 2023-11-24T06:52:46+01:00     Nov 24 05:52:46 10.211.11.214 [ 2274.485097]
> RAX:
> 0000000000000000 RBX: ffff8c5e48fe39e8 RCX: ffff8c5e48fe1be8
> 2023-11-24T06:52:46+01:00     Nov 24 05:52:46 10.211.11.214 [ 2274.493523]
> RDX:
> 0000000000000000 RSI: 0000000000000001 RDI: 0000000000000001
> 2023-11-24T06:52:46+01:00     Nov 24 05:52:46 10.211.11.214 [ 2274.501937]
> RBP:
> 0000000000000000 R08: 0000000000000000 R09: 0000000000000000
> 2023-11-24T06:52:46+01:00     Nov 24 05:52:46 10.211.11.214 [ 2274.510353]
> R10:
> 0000000000000c30 R11: 00000000c12c4cc9 R12: 00000000ffffffff
> 2023-11-24T06:52:46+01:00     Nov 24 05:52:46 10.211.11.214 [ 2274.518768]
> R13:
> 0000000000000000 R14: 0000000000000000 R15: 0000000000000000
> 2023-11-24T06:52:46+01:00     Nov 24 05:52:46 10.211.11.214 [ 2274.527182]
> FS: 
> 0000000000000000(0000) GS:ffff8c5d5eb00000(0000) knlGS:0000000000000000
> 2023-11-24T06:52:46+01:00     Nov 24 05:52:46 10.211.11.214 [ 2274.536663]
> CS: 
> 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> 2023-11-24T06:52:46+01:00     Nov 24 05:52:46 10.211.11.214 [ 2274.543524]
> CR2:
> 00007f925b190f5c CR3: 00000001fd222005 CR4: 00000000001706e0
> 2023-11-24T06:52:46+01:00     Nov 24 05:52:46 10.211.11.214 [ 2274.551939]
> Call
> Trace:
> 2023-11-24T06:52:46+01:00     Nov 24 05:52:46 10.211.11.214 [ 2274.555110] 
> <TASK>
> 2023-11-24T06:52:46+01:00     Nov 24 05:52:46 10.211.11.214 [ 2274.557886]  ?
> die+0x36/0x90
> 2023-11-24T06:52:46+01:00     Nov 24 05:52:46 10.211.11.214 [ 2274.561539]  ?
> do_trap+0xda/0x100
> 2023-11-24T06:52:46+01:00     Nov 24 05:52:46 10.211.11.214 [ 2274.565670]  ?
> blk_stack_limits+0x19e/0x4d0
> 2023-11-24T06:52:46+01:00     Nov 24 05:52:46 10.211.11.214 [ 2274.570766]  ?
> do_error_trap+0x6a/0x90
> 2023-11-24T06:52:46+01:00     Nov 24 05:52:46 10.211.11.214 [ 2274.575367]  ?
> blk_stack_limits+0x19e/0x4d0
> 2023-11-24T06:52:46+01:00     Nov 24 05:52:46 10.211.11.214 [ 2274.580447]  ?
> exc_divide_error+0x38/0x50
> 2023-11-24T06:52:46+01:00     Nov 24 05:52:46 10.211.11.214 [ 2274.585332]  ?
> blk_stack_limits+0x19e/0x4d0
> 2023-11-24T06:52:46+01:00     Nov 24 05:52:46 10.211.11.214 [ 2274.590405]  ?
> asm_exc_divide_error+0x1a/0x20
> 2023-11-24T06:52:46+01:00     Nov 24 05:52:46 10.211.11.214 [ 2274.595668]  ?
> blk_stack_limits+0x19e/0x4d0
> 2023-11-24T06:52:46+01:00     Nov 24 05:52:46 10.211.11.214 [ 2274.600726]  ?
> __queue_work+0x1e0/0x450
> 2023-11-24T06:52:46+01:00     Nov 24 05:52:46 10.211.11.214 [ 2274.605402] 
> nvme_update_ns_info_block+0x457/0x680 [nvme_core]
> 2023-11-24T06:52:46+01:00     Nov 24 05:52:46 10.211.11.214 [ 2274.612330] 
> nvme_scan_ns+0x1ec/0xde0 [nvme_core]
> 2023-11-24T06:52:46+01:00     Nov 24 05:52:46 10.211.11.214 [ 2274.618059] 
> nvme_scan_work+0x2a6/0x5e0 [nvme_core]
> 2023-11-24T06:52:46+01:00     Nov 24 05:52:46 10.211.11.214 [ 2274.623984] 
> process_one_work+0x1e2/0x3e0
> 2023-11-24T06:52:46+01:00     Nov 24 05:52:46 10.211.11.214 [ 2274.628870] 
> worker_thread+0x1da/0x390
> 2023-11-24T06:52:46+01:00     Nov 24 05:52:46 10.211.11.214 [ 2274.633413]
> nvme0c0n2: I/O Cmd(0x2) @ LBA 888, 8 blocks, I/O Error (sct 0x0 / sc 0xb)
> DNR 
> 2023-11-24T06:52:46+01:00     Nov 24 05:52:46 10.211.11.214 [ 2274.633456]  ?
> __pfx_worker_thread+0x10/0x10
> 2023-11-24T06:52:46+01:00     Nov 24 05:52:46 10.211.11.214 [ 2274.642692]
> critical target error, dev nvme0c0n2, sector 888 op 0x0:(READ) flags
> 0x2000000 phys_seg 1 prio class 2
> 2023-11-24T06:52:46+01:00     Nov 24 05:52:46 10.211.11.214 [ 2274.647446] 
> kthread+0xe8/0x120
> 2023-11-24T06:52:46+01:00     Nov 24 05:52:46 10.211.11.214 [ 2274.647452]  ?
> __pfx_kthread+0x10/0x10
> 2023-11-24T06:52:46+01:00     Nov 24 05:52:46 10.211.11.214 [ 2274.668700] 
> ret_from_fork+0x34/0x50
> 2023-11-24T06:52:46+01:00     Nov 24 05:52:46 10.211.11.214 [ 2274.673082]  ?
> __pfx_kthread+0x10/0x10
> 2023-11-24T06:52:46+01:00     Nov 24 05:52:46 10.211.11.214 [ 2274.677645] 
> ret_from_fork_asm+0x1b/0x30
> 2023-11-24T06:52:46+01:00     Nov 24 05:52:46 10.211.11.214 [ 2274.682400] 
> </TASK>
> 2023-11-24T06:52:46+01:00     Nov 24 05:52:46 10.211.11.214 [ 2274.685200]
> Modules linked in: nvme_tcp nvme_fabrics xfs iptable_filter bridge stp llc
> veth qat_c62xvf(OE) vfio_pci vfio_pci_core vfio_iommu_type1 vfio iommufd nbd
> rfkill usdm_drv(OE) sunrpc binfmt_misc intel_rapl_msr intel_rapl_common
> sb_edac x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel kvm
> irqbypass crct10dif_pclmul crc32_pclmul crc32c_intel polyval_clmulni
> polyval_generic qat_c62x(OE) ghash_clmulni_intel sha512_ssse3 intel_qat(OE)
> i40e rapl ib_uverbs ipmi_si iTCO_wdt intel_cstate intel_pmc_bxt mei_me
> ipmi_devintf iTCO_vendor_support intel_uncore ib_core i2c_i801 pcspkr mei
> ipmi_msghandler mgag200 joydev ioatdma dax_pmem lpc_ich uio i2c_smbus wmi
> ip6_tables ip_tables fuse zram bpf_preload loop overlay squashfs netconsole
> nd_pmem nd_btt nd_e820 libnvdimm virtio_blk virtio_net net_failover failover
> uas usb_storage ice(OE) gnss nvme nvme_core nvme_common mlx5_core mlxfw
> psample tls pci_hyperv_intf ixgbe mdio igb i2c_algo_bit dca [last unloaded:
> nvme_fabrics]
> 2023-11-24T06:52:46+01:00     Nov 24 05:52:46 10.211.11.214 [ 2274.784217]
> ---[
> end trace 0000000000000000 ]---
> 2023-11-24T06:52:46+01:00     Nov 24 05:52:46 10.211.11.214 [ 2274.792888]
> RIP:
> 0010:blk_stack_limits+0x19e/0x4d0
> 2023-11-24T06:52:46+01:00     Nov 24 05:52:46 10.211.11.214 [ 2274.798643]
> Code:
> b6 41 6a 08 43 6a 8b 41 30 8b 71 3c 8b 79 38 44 8b 4b 38 39 c6 0f 42 f0 48
> 8b 04 24 31 d2 45 31 ff 41 89 f0 01 f7 41 c1 e8 09 <49> f7 f0 44 8b 43 30 89
> f8 8b 7b 3c c1 e2 09 29 d0 31 d2 f7 f6 89
> 2023-11-24T06:52:46+01:00     Nov 24 05:52:46 10.211.11.214 [ 2274.820412]
> RSP:
> 0018:ffffb90d800c3c20 EFLAGS: 00010246
> 2023-11-24T06:52:46+01:00     Nov 24 05:52:46 10.211.11.214 [ 2274.826646]
> RAX:
> 0000000000000000 RBX: ffff8c5e48fe39e8 RCX: ffff8c5e48fe1be8
> 2023-11-24T06:52:46+01:00     Nov 24 05:52:46 10.211.11.214 [ 2274.835011]
> RDX:
> 0000000000000000 RSI: 0000000000000001 RDI: 0000000000000001
> 2023-11-24T06:52:46+01:00     Nov 24 05:52:46 10.211.11.214 [ 2274.843383]
> RBP:
> 0000000000000000 R08: 0000000000000000 R09: 0000000000000000
> 2023-11-24T06:52:46+01:00     Nov 24 05:52:46 10.211.11.214 [ 2274.851753]
> R10:
> 0000000000000c30 R11: 00000000c12c4cc9 R12: 00000000ffffffff
> 2023-11-24T06:52:46+01:00     Nov 24 05:52:46 10.211.11.214 [ 2274.860125]
> R13:
> 0000000000000000 R14: 0000000000000000 R15: 0000000000000000
> 2023-11-24T06:52:46+01:00     Nov 24 05:52:46 10.211.11.214 [ 2274.868492]
> FS: 
> 0000000000000000(0000) GS:ffff8c5d5eb00000(0000) knlGS:0000000000000000
> 2023-11-24T06:52:46+01:00     Nov 24 05:52:46 10.211.11.214 [ 2274.877931]
> CS: 
> 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> 2023-11-24T06:52:46+01:00     Nov 24 05:52:46 10.211.11.214 [ 2274.884748]
> CR2:
> 00007f925b190f5c CR3: 00000001fd222005 CR4: 00000000001706e0
> 2023-11-24T06:52:46+01:00     Nov 24 05:52:46 10.211.11.214 [ 2274.893207]
> Kernel panic - not syncing: Fatal exception
> 2023-11-24T06:52:46+01:00     Nov 24 05:52:46 10.211.11.214 [ 2274.899569]
> Kernel Offset: 0x3a000000 from 0xffffffff81000000 (relocation range:
> 0xffffffff80000000-0xffffffffbfffffff)
> 2023-11-24T06:52:46+01:00     Nov 24 05:52:46 10.211.11.214 [ 2274.915130]
> Rebooting in 5 seconds..
> 
> 
> It looks like fedora applies some patches under driver/nvme/* but from the
> quick look at them they don't seem to be relevant in the context of the
> above failure.
> 
> Any hints regarding root cause and/or potential fixes would be appreciated.
> :)

You may test current mainline (v6.7-rc2).
Comment 2 michallinuxstuff 2023-11-24 16:16:44 UTC
(In reply to Bagas Sanjaya from comment #1)
> (In reply to michallinuxstuff from comment #0)
> > OS: Fedora38
> > Kernel: 6.5.12 (fedora build; 6.5.12-200.fc38.x86_64)
> > 
> > This issue happens intermittently, during fio workload (haven't found a way
> > to reliably reproduce it other than spinning these tests until the panic
> > occurs).
> > 
> > Trace looks like the following (kernel is tainted due to out-of-tree ICE
> and
> > QAT drivers, just fyi (devices that are bound to those are not in use
> > though)):
> > 
> > 
> > 2023-11-24T06:52:45+01:00     Nov 24 05:52:45 10.211.11.214 [ 2274.131237]
> > nvme
> > nvme0: Identify Descriptors failed (nsid=3, status=0xb)
> > 2023-11-24T06:52:45+01:00     Nov 24 05:52:45 10.211.11.214 [ 2274.179365]
> > nvme
> > nvme0: rescanning namespaces.
> > 2023-11-24T06:52:46+01:00     Nov 24 05:52:45 10.211.11.214 [ 2274.354494]
> > nvme0c0n1: I/O Cmd(0x2) @ LBA 704, 8 blocks, I/O Error (sct 0x0 / sc 0xb)
> > DNR 
> > 2023-11-24T06:52:46+01:00     Nov 24 05:52:45 10.211.11.214 [ 2274.364817]
> > critical target error, dev nvme0c0n1, sector 704 op 0x0:(READ) flags
> > 0x2000000 phys_seg 1 prio class 2
> > 2023-11-24T06:52:46+01:00     Nov 24 05:52:45 10.211.11.214 [ 2274.409035]
> > nvme0n1: detected capacity change from 131072 to 0
> > 2023-11-24T06:52:46+01:00     Nov 24 05:52:45 10.211.11.214 [ 2274.416061]
> > divide error: 0000 [#1] PREEMPT SMP PTI
> > 2023-11-24T06:52:46+01:00     Nov 24 05:52:45 10.211.11.214 [ 2274.421939]
> > CPU:
> > 4 PID: 11 Comm: kworker/u321:0 Tainted: G           OE     
> > 6.5.12-100.fc37.x86_64 #1
> > 2023-11-24T06:52:46+01:00     Nov 24 05:52:46 10.211.11.214 [ 2274.432767]
> > Hardware name: Intel Corporation S2600GZ/S2600GZ, BIOS
> > SE5C600.86B.02.01.0002.082220131453 08/22/2013
> > 2023-11-24T06:52:46+01:00     Nov 24 05:52:46 10.211.11.214 [ 2274.444662]
> > Workqueue: nvme-wq nvme_scan_work [nvme_core]
> > 2023-11-24T06:52:46+01:00     Nov 24 05:52:46 10.211.11.214 [ 2274.451154]
> > RIP:
> > 0010:blk_stack_limits+0x19e/0x4d0
> > 2023-11-24T06:52:46+01:00     Nov 24 05:52:46 10.211.11.214 [ 2274.456945]
> > Code:
> > b6 41 6a 08 43 6a 8b 41 30 8b 71 3c 8b 79 38 44 8b 4b 38 39 c6 0f 42 f0 48
> > 8b 04 24 31 d2 45 31 ff 41 89 f0 01 f7 41 c1 e8 09 <49> f7 f0 44 8b 43 30
> 89
> > f8 8b 7b 3c c1 e2 09 29 d0 31 d2 f7 f6 89
> > 2023-11-24T06:52:46+01:00     Nov 24 05:52:46 10.211.11.214 [ 2274.478807]
> > RSP:
> > 0018:ffffb90d800c3c20 EFLAGS: 00010246
> > 2023-11-24T06:52:46+01:00     Nov 24 05:52:46 10.211.11.214 [ 2274.485097]
> > RAX:
> > 0000000000000000 RBX: ffff8c5e48fe39e8 RCX: ffff8c5e48fe1be8
> > 2023-11-24T06:52:46+01:00     Nov 24 05:52:46 10.211.11.214 [ 2274.493523]
> > RDX:
> > 0000000000000000 RSI: 0000000000000001 RDI: 0000000000000001
> > 2023-11-24T06:52:46+01:00     Nov 24 05:52:46 10.211.11.214 [ 2274.501937]
> > RBP:
> > 0000000000000000 R08: 0000000000000000 R09: 0000000000000000
> > 2023-11-24T06:52:46+01:00     Nov 24 05:52:46 10.211.11.214 [ 2274.510353]
> > R10:
> > 0000000000000c30 R11: 00000000c12c4cc9 R12: 00000000ffffffff
> > 2023-11-24T06:52:46+01:00     Nov 24 05:52:46 10.211.11.214 [ 2274.518768]
> > R13:
> > 0000000000000000 R14: 0000000000000000 R15: 0000000000000000
> > 2023-11-24T06:52:46+01:00     Nov 24 05:52:46 10.211.11.214 [ 2274.527182]
> > FS: 
> > 0000000000000000(0000) GS:ffff8c5d5eb00000(0000) knlGS:0000000000000000
> > 2023-11-24T06:52:46+01:00     Nov 24 05:52:46 10.211.11.214 [ 2274.536663]
> > CS: 
> > 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> > 2023-11-24T06:52:46+01:00     Nov 24 05:52:46 10.211.11.214 [ 2274.543524]
> > CR2:
> > 00007f925b190f5c CR3: 00000001fd222005 CR4: 00000000001706e0
> > 2023-11-24T06:52:46+01:00     Nov 24 05:52:46 10.211.11.214 [ 2274.551939]
> > Call
> > Trace:
> > 2023-11-24T06:52:46+01:00     Nov 24 05:52:46 10.211.11.214 [ 2274.555110] 
> > <TASK>
> > 2023-11-24T06:52:46+01:00     Nov 24 05:52:46 10.211.11.214 [ 2274.557886] 
> ?
> > die+0x36/0x90
> > 2023-11-24T06:52:46+01:00     Nov 24 05:52:46 10.211.11.214 [ 2274.561539] 
> ?
> > do_trap+0xda/0x100
> > 2023-11-24T06:52:46+01:00     Nov 24 05:52:46 10.211.11.214 [ 2274.565670] 
> ?
> > blk_stack_limits+0x19e/0x4d0
> > 2023-11-24T06:52:46+01:00     Nov 24 05:52:46 10.211.11.214 [ 2274.570766] 
> ?
> > do_error_trap+0x6a/0x90
> > 2023-11-24T06:52:46+01:00     Nov 24 05:52:46 10.211.11.214 [ 2274.575367] 
> ?
> > blk_stack_limits+0x19e/0x4d0
> > 2023-11-24T06:52:46+01:00     Nov 24 05:52:46 10.211.11.214 [ 2274.580447] 
> ?
> > exc_divide_error+0x38/0x50
> > 2023-11-24T06:52:46+01:00     Nov 24 05:52:46 10.211.11.214 [ 2274.585332] 
> ?
> > blk_stack_limits+0x19e/0x4d0
> > 2023-11-24T06:52:46+01:00     Nov 24 05:52:46 10.211.11.214 [ 2274.590405] 
> ?
> > asm_exc_divide_error+0x1a/0x20
> > 2023-11-24T06:52:46+01:00     Nov 24 05:52:46 10.211.11.214 [ 2274.595668] 
> ?
> > blk_stack_limits+0x19e/0x4d0
> > 2023-11-24T06:52:46+01:00     Nov 24 05:52:46 10.211.11.214 [ 2274.600726] 
> ?
> > __queue_work+0x1e0/0x450
> > 2023-11-24T06:52:46+01:00     Nov 24 05:52:46 10.211.11.214 [ 2274.605402] 
> > nvme_update_ns_info_block+0x457/0x680 [nvme_core]
> > 2023-11-24T06:52:46+01:00     Nov 24 05:52:46 10.211.11.214 [ 2274.612330] 
> > nvme_scan_ns+0x1ec/0xde0 [nvme_core]
> > 2023-11-24T06:52:46+01:00     Nov 24 05:52:46 10.211.11.214 [ 2274.618059] 
> > nvme_scan_work+0x2a6/0x5e0 [nvme_core]
> > 2023-11-24T06:52:46+01:00     Nov 24 05:52:46 10.211.11.214 [ 2274.623984] 
> > process_one_work+0x1e2/0x3e0
> > 2023-11-24T06:52:46+01:00     Nov 24 05:52:46 10.211.11.214 [ 2274.628870] 
> > worker_thread+0x1da/0x390
> > 2023-11-24T06:52:46+01:00     Nov 24 05:52:46 10.211.11.214 [ 2274.633413]
> > nvme0c0n2: I/O Cmd(0x2) @ LBA 888, 8 blocks, I/O Error (sct 0x0 / sc 0xb)
> > DNR 
> > 2023-11-24T06:52:46+01:00     Nov 24 05:52:46 10.211.11.214 [ 2274.633456] 
> ?
> > __pfx_worker_thread+0x10/0x10
> > 2023-11-24T06:52:46+01:00     Nov 24 05:52:46 10.211.11.214 [ 2274.642692]
> > critical target error, dev nvme0c0n2, sector 888 op 0x0:(READ) flags
> > 0x2000000 phys_seg 1 prio class 2
> > 2023-11-24T06:52:46+01:00     Nov 24 05:52:46 10.211.11.214 [ 2274.647446] 
> > kthread+0xe8/0x120
> > 2023-11-24T06:52:46+01:00     Nov 24 05:52:46 10.211.11.214 [ 2274.647452] 
> ?
> > __pfx_kthread+0x10/0x10
> > 2023-11-24T06:52:46+01:00     Nov 24 05:52:46 10.211.11.214 [ 2274.668700] 
> > ret_from_fork+0x34/0x50
> > 2023-11-24T06:52:46+01:00     Nov 24 05:52:46 10.211.11.214 [ 2274.673082] 
> ?
> > __pfx_kthread+0x10/0x10
> > 2023-11-24T06:52:46+01:00     Nov 24 05:52:46 10.211.11.214 [ 2274.677645] 
> > ret_from_fork_asm+0x1b/0x30
> > 2023-11-24T06:52:46+01:00     Nov 24 05:52:46 10.211.11.214 [ 2274.682400] 
> > </TASK>
> > 2023-11-24T06:52:46+01:00     Nov 24 05:52:46 10.211.11.214 [ 2274.685200]
> > Modules linked in: nvme_tcp nvme_fabrics xfs iptable_filter bridge stp llc
> > veth qat_c62xvf(OE) vfio_pci vfio_pci_core vfio_iommu_type1 vfio iommufd
> nbd
> > rfkill usdm_drv(OE) sunrpc binfmt_misc intel_rapl_msr intel_rapl_common
> > sb_edac x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel kvm
> > irqbypass crct10dif_pclmul crc32_pclmul crc32c_intel polyval_clmulni
> > polyval_generic qat_c62x(OE) ghash_clmulni_intel sha512_ssse3 intel_qat(OE)
> > i40e rapl ib_uverbs ipmi_si iTCO_wdt intel_cstate intel_pmc_bxt mei_me
> > ipmi_devintf iTCO_vendor_support intel_uncore ib_core i2c_i801 pcspkr mei
> > ipmi_msghandler mgag200 joydev ioatdma dax_pmem lpc_ich uio i2c_smbus wmi
> > ip6_tables ip_tables fuse zram bpf_preload loop overlay squashfs netconsole
> > nd_pmem nd_btt nd_e820 libnvdimm virtio_blk virtio_net net_failover
> failover
> > uas usb_storage ice(OE) gnss nvme nvme_core nvme_common mlx5_core mlxfw
> > psample tls pci_hyperv_intf ixgbe mdio igb i2c_algo_bit dca [last unloaded:
> > nvme_fabrics]
> > 2023-11-24T06:52:46+01:00     Nov 24 05:52:46 10.211.11.214 [ 2274.784217]
> > ---[
> > end trace 0000000000000000 ]---
> > 2023-11-24T06:52:46+01:00     Nov 24 05:52:46 10.211.11.214 [ 2274.792888]
> > RIP:
> > 0010:blk_stack_limits+0x19e/0x4d0
> > 2023-11-24T06:52:46+01:00     Nov 24 05:52:46 10.211.11.214 [ 2274.798643]
> > Code:
> > b6 41 6a 08 43 6a 8b 41 30 8b 71 3c 8b 79 38 44 8b 4b 38 39 c6 0f 42 f0 48
> > 8b 04 24 31 d2 45 31 ff 41 89 f0 01 f7 41 c1 e8 09 <49> f7 f0 44 8b 43 30
> 89
> > f8 8b 7b 3c c1 e2 09 29 d0 31 d2 f7 f6 89
> > 2023-11-24T06:52:46+01:00     Nov 24 05:52:46 10.211.11.214 [ 2274.820412]
> > RSP:
> > 0018:ffffb90d800c3c20 EFLAGS: 00010246
> > 2023-11-24T06:52:46+01:00     Nov 24 05:52:46 10.211.11.214 [ 2274.826646]
> > RAX:
> > 0000000000000000 RBX: ffff8c5e48fe39e8 RCX: ffff8c5e48fe1be8
> > 2023-11-24T06:52:46+01:00     Nov 24 05:52:46 10.211.11.214 [ 2274.835011]
> > RDX:
> > 0000000000000000 RSI: 0000000000000001 RDI: 0000000000000001
> > 2023-11-24T06:52:46+01:00     Nov 24 05:52:46 10.211.11.214 [ 2274.843383]
> > RBP:
> > 0000000000000000 R08: 0000000000000000 R09: 0000000000000000
> > 2023-11-24T06:52:46+01:00     Nov 24 05:52:46 10.211.11.214 [ 2274.851753]
> > R10:
> > 0000000000000c30 R11: 00000000c12c4cc9 R12: 00000000ffffffff
> > 2023-11-24T06:52:46+01:00     Nov 24 05:52:46 10.211.11.214 [ 2274.860125]
> > R13:
> > 0000000000000000 R14: 0000000000000000 R15: 0000000000000000
> > 2023-11-24T06:52:46+01:00     Nov 24 05:52:46 10.211.11.214 [ 2274.868492]
> > FS: 
> > 0000000000000000(0000) GS:ffff8c5d5eb00000(0000) knlGS:0000000000000000
> > 2023-11-24T06:52:46+01:00     Nov 24 05:52:46 10.211.11.214 [ 2274.877931]
> > CS: 
> > 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> > 2023-11-24T06:52:46+01:00     Nov 24 05:52:46 10.211.11.214 [ 2274.884748]
> > CR2:
> > 00007f925b190f5c CR3: 00000001fd222005 CR4: 00000000001706e0
> > 2023-11-24T06:52:46+01:00     Nov 24 05:52:46 10.211.11.214 [ 2274.893207]
> > Kernel panic - not syncing: Fatal exception
> > 2023-11-24T06:52:46+01:00     Nov 24 05:52:46 10.211.11.214 [ 2274.899569]
> > Kernel Offset: 0x3a000000 from 0xffffffff81000000 (relocation range:
> > 0xffffffff80000000-0xffffffffbfffffff)
> > 2023-11-24T06:52:46+01:00     Nov 24 05:52:46 10.211.11.214 [ 2274.915130]
> > Rebooting in 5 seconds..
> > 
> > 
> > It looks like fedora applies some patches under driver/nvme/* but from the
> > quick look at them they don't seem to be relevant in the context of the
> > above failure.
> > 
> > Any hints regarding root cause and/or potential fixes would be appreciated.
> > :)
> 
> You may test current mainline (v6.7-rc2).

Would you kindly point me in the direction of a commit available in v6.7-rc2 that may be addressing this issue? I can test this build, but considering the intermittent nature of the issue it will be hard to tell if the issue itself is really gone especially if there are no specific changes in the code to look at. :)
Comment 3 Bagas Sanjaya 2023-11-25 02:24:42 UTC
(In reply to michallinuxstuff from comment #2)
> 
> Would you kindly point me in the direction of a commit available in v6.7-rc2
> that may be addressing this issue? I can test this build, but considering
> the intermittent nature of the issue it will be hard to tell if the issue
> itself is really gone especially if there are no specific changes in the
> code to look at. :)

I mean please test to see if your issue has already been addressed or not.
Comment 4 Keith Busch 2023-11-25 15:16:18 UTC
I'm not even sure where a division even occurs in blk_stack_limits() that isn't already checked for '0'. I'll keep looking, but the problem is preceded by your nvme target appearing to drop the namespaces your host is attempting to communicate with. Was that part expected in your tests?
Comment 5 michallinuxstuff 2023-11-27 08:48:48 UTC
(In reply to Keith Busch from comment #4)
> I'm not even sure where a division even occurs in blk_stack_limits() that
> isn't already checked for '0'. I'll keep looking, but the problem is
> preceded by your nvme target appearing to drop the namespaces your host is
> attempting to communicate with. Was that part expected in your tests?

Appreciate the feedback! :)

It is expected, yes. Essentially, the test works like this: the nvmf target is handled by the SPDK (master branch) - several bdevs, malloc + raid, bundled into single subsystem, exposing several namespace devices. The test then connects to the said subsystem. We gather all the block nvme devices associated with the target and we start executing fio (3.35) workload against them. The second portion of the test executes fio in the background, against same set of devices, but during the workload we start to remove all the bdevs (so essentially the namespaces). 

All in all the test's goal is quite simple, just checking if anything amiss is happening during workload + sudden removal of the target devices. When that's all done we finally disconnect from the subsystem.

Just to note, the test itself is quite old, in the sense that it managed to run successfully under both VM and phy environments and under different, older kernel versions. This issue happened under our staging platform where we have been testing newer distro, with the 6.5.12 kernel on board - the "production" is still using 6.1 (it's been for quite a long time now) and we haven't seen similar problems there.

I can try to bisect the kernel (6.1.14..6.5.12) and test different builds but since I am not that verse in kernel's internals I may miss some stuff (especially that the issue itself is intermittent in nature) - that's why any extra help while looking into that issue is greatly appreciated.
Comment 6 Keith Busch 2023-11-27 18:55:24 UTC
Thanks for confirming. I think you should try the most recent 6.7-rc3 release. The potentialy relevant commit is included here: 

  https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit?id=cd9aed606088d36a7ffff3e808db4e76b1854285
Comment 7 Keith Busch 2023-11-27 22:02:17 UTC
Sorry, I was mistaken with previous suggestion. I think the actual fix is this commit staged here, but it isn't upstream yet (should be by 6.7-rc4)

  http://git.infradead.org/nvme.git/commitdiff/d8b90d600aff181936457f032d116dbd8534db06
Comment 8 michallinuxstuff 2023-11-28 12:19:06 UTC
(In reply to Keith Busch from comment #7)
> Sorry, I was mistaken with previous suggestion. I think the actual fix is
> this commit staged here, but it isn't upstream yet (should be by 6.7-rc4)
> 
>  
> http://git.infradead.org/nvme.git/commitdiff/
> d8b90d600aff181936457f032d116dbd8534db06

Brilliant, thank you! Will try to test it out.

Note You need to log in before you can comment on or make changes to this bug.