Tested on 5.14 and 5.16.0-rc2, Debian testing, amd64, gcc 11.2.0. Reproduction: modprobe nvmet modprobe nvmet-tcp truncate --size 1G /tmp/foo1.img DEV=$(losetup --find /tmp/foo1.img) mkdir /sys/kernel/config/nvmet/subsystems/testsubsystem cd /sys/kernel/config/nvmet/subsystems/testsubsystem echo 1 > attr_allow_any_host echo "${DEV}" > device_path echo 1 > enable mkdir /sys/kernel/config/nvmet/ports/1 cd /sys/kernel/config/nvmet/ports/1 echo 4420 > addr_trsvcid echo 127.0.0.1 > addr_traddr echo tcp > addr_trtype echo ipv4 > addr_adrfam ln -s /sys/kernel/config/nvmet/subsystems/testsubsystem/ /sys/kernel/config/nvmet/ports/1/subsystems/testsubsystem modprobe nvme-tcp nvme connect -t tcp -n testsubsystem -a 127.0.0.1 -s 4420 nvme list nvme delete-ns -n 1 /dev/nvmeX Eventually will finish with Interuptted, but the device will be unusable after this. The nvme-tcp shows issues in dmesg: [10638.139221] nvmet: creating nvm controller 1 for subsystem testsubsystem for NQN nqn.2014-08.org.nvmexpress:uuid:88a968b7-a5d7-4549-aaa8-a31f1d25aa4b. [10638.139677] nvme nvme2: creating 32 I/O queues. [10638.141556] nvme nvme2: mapped 32/0/0 default/read/poll queues. [10638.145884] nvme nvme2: new ctrl: NQN "testsubsystem", addr 127.0.0.1:4420 [10678.676734] nvmet_tcp: failed cmd 00000000fdc07f80 id 4100 opcode 13, data_len: 0 [10684.129665] nvme nvme2: queue 0: timeout request 0x0 type 4 [10684.129673] nvme nvme2: starting error recovery [10684.131074] nvme nvme2: failed nvme_keep_alive_end_io error=10 [10684.137695] nvme nvme2: Reconnecting in 10 seconds... [10691.809468] nvmet: ctrl 1 keep-alive timer (5 seconds) expired! [10691.809519] nvmet: ctrl 1 fatal error occurred! [10756.832004] nvme nvme2: queue 0: timeout request 0x0 type 4 [10756.832137] nvme nvme2: Connect command failed, error wo/DNR bit: 881 [10756.832199] nvme nvme2: failed to connect queue: 0 ret=881 [10756.832222] nvme nvme2: Failed reconnect attempt 1 [10756.832225] nvme nvme2: Reconnecting in 10 seconds... [10814.174824] INFO: task kworker/11:2:11273 blocked for more than 122 seconds. [10814.174832] Tainted: G E 5.16.0-rc2 #1 [10814.174834] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [10814.174835] task:kworker/11:2 state:D stack: 0 pid:11273 ppid: 2 flags:0x00004000 [10814.174841] Workqueue: events nvmet_tcp_release_queue_work [nvmet_tcp] [10814.174850] Call Trace: [10814.174852] <TASK> [10814.174855] __schedule+0x23f/0xe50 [10814.174861] ? update_load_avg+0x77/0x4d0 [10814.174866] schedule+0x50/0xd0 [10814.174869] schedule_timeout+0x99/0xd0 [10814.174873] ? raw_spin_rq_lock_nested+0x19/0x80 [10814.174876] wait_for_completion+0x74/0xc0 [10814.174880] __flush_work.isra.0+0x14b/0x210 [10814.174884] ? flush_workqueue_prep_pwqs+0x110/0x110 [10814.174887] nvmet_tcp_release_queue_work+0xc3/0x330 [nvmet_tcp] [10814.174892] process_one_work+0x1ad/0x310 [10814.174895] worker_thread+0x48/0x3d0 [10814.174899] ? process_one_work+0x310/0x310 [10814.174901] kthread+0x155/0x180 [10814.174904] ? set_kthread_struct+0x30/0x30 [10814.174907] ret_from_fork+0x1f/0x30 [10814.174912] </TASK> [10830.558291] nvme nvme2: queue 0: timeout request 0x0 type 4 [10830.558422] nvme nvme2: Connect command failed, error wo/DNR bit: 881 [10830.558484] nvme nvme2: failed to connect queue: 0 ret=881 [10830.558507] nvme nvme2: Failed reconnect attempt 2 [10830.558510] nvme nvme2: Reconnecting in 10 seconds... [10904.284728] nvme nvme2: queue 0: timeout request 0x0 type 4 [10904.284757] nvme nvme2: Connect command failed, error wo/DNR bit: 881 [10904.284818] nvme nvme2: failed to connect queue: 0 ret=881 [10904.284842] nvme nvme2: Failed reconnect attempt 3 [10904.284845] nvme nvme2: Reconnecting in 10 seconds... [10937.051999] INFO: task kworker/11:2:11273 blocked for more than 245 seconds. [10937.052010] Tainted: G E 5.16.0-rc2 #1 [10937.052013] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [10937.052016] task:kworker/11:2 state:D stack: 0 pid:11273 ppid: 2 flags:0x00004000 [10937.052024] Workqueue: events nvmet_tcp_release_queue_work [nvmet_tcp] [10937.052036] Call Trace: [10937.052038] <TASK> [10937.052042] __schedule+0x23f/0xe50 [10937.052051] ? update_load_avg+0x77/0x4d0 [10937.052058] schedule+0x50/0xd0 [10937.052062] schedule_timeout+0x99/0xd0 [10937.052066] ? raw_spin_rq_lock_nested+0x19/0x80 [10937.052070] wait_for_completion+0x74/0xc0 [10937.052074] __flush_work.isra.0+0x14b/0x210 [10937.052078] ? flush_workqueue_prep_pwqs+0x110/0x110 [10937.052082] nvmet_tcp_release_queue_work+0xc3/0x330 [nvmet_tcp] [10937.052086] process_one_work+0x1ad/0x310 [10937.052090] worker_thread+0x48/0x3d0 [10937.052093] ? process_one_work+0x310/0x310 [10937.052096] kthread+0x155/0x180 [10937.052100] ? set_kthread_struct+0x30/0x30 [10937.052103] ret_from_fork+0x1f/0x30 [10937.052108] </TASK> [10978.010919] nvme nvme2: queue 0: timeout request 0x0 type 4 [10978.010974] nvme nvme2: Connect command failed, error wo/DNR bit: 881 [10978.011031] nvme nvme2: failed to connect queue: 0 ret=881 [10978.011052] nvme nvme2: Failed reconnect attempt 4 [10978.011055] nvme nvme2: Reconnecting in 10 seconds... [11051.737218] nvme nvme2: queue 0: timeout request 0x0 type 4 [11051.737247] nvme nvme2: Connect command failed, error wo/DNR bit: 881 [11051.737307] nvme nvme2: failed to connect queue: 0 ret=881 [11051.737330] nvme nvme2: Failed reconnect attempt 5 [11051.737333] nvme nvme2: Reconnecting in 10 seconds... [11059.929161] INFO: task kworker/11:2:11273 blocked for more than 368 seconds. [11059.929168] Tainted: G E 5.16.0-rc2 #1 [11059.929171] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [11059.929172] task:kworker/11:2 state:D stack: 0 pid:11273 ppid: 2 flags:0x00004000 [11059.929178] Workqueue: events nvmet_tcp_release_queue_work [nvmet_tcp] [11059.929188] Call Trace: [11059.929189] <TASK> [11059.929192] __schedule+0x23f/0xe50 [11059.929198] ? update_load_avg+0x77/0x4d0 [11059.929204] schedule+0x50/0xd0 [11059.929207] schedule_timeout+0x99/0xd0 [11059.929210] ? raw_spin_rq_lock_nested+0x19/0x80 [11059.929214] wait_for_completion+0x74/0xc0 [11059.929217] __flush_work.isra.0+0x14b/0x210 [11059.929221] ? flush_workqueue_prep_pwqs+0x110/0x110 [11059.929225] nvmet_tcp_release_queue_work+0xc3/0x330 [nvmet_tcp] [11059.929229] process_one_work+0x1ad/0x310 [11059.929232] worker_thread+0x48/0x3d0 [11059.929236] ? process_one_work+0x310/0x310 [11059.929238] kthread+0x155/0x180 [11059.929241] ? set_kthread_struct+0x30/0x30 [11059.929244] ret_from_fork+0x1f/0x30 [11059.929249] </TASK>
At the start, `DEV=$(losetup --find /tmp/foo1.img)` should have been `DEV=$(losetup --show --find /tmp/foo1.img)` The kernel messages about stack tasks, continues forever. It also causes issues with other IO on the system, being sporadically blocked too.
That's strange, the delete namespace operation isn't supported by the nvme target. It should just return invalid opcode immediately. I'll retry your steps and see what happens.
Is the delete-ns command actually necessary to trigger this condition?
[adding Sagi & linux-nvme mailing list] On Sat, Nov 27, 2021 at 11:24:57PM +0000, bugzilla-daemon@bugzilla.kernel.org wrote: > https://bugzilla.kernel.org/show_bug.cgi?id=215157 <snip> > Reproduction: > > > modprobe nvmet > modprobe nvmet-tcp > truncate --size 1G /tmp/foo1.img > DEV=$(losetup --find /tmp/foo1.img) > > mkdir /sys/kernel/config/nvmet/subsystems/testsubsystem > cd /sys/kernel/config/nvmet/subsystems/testsubsystem > echo 1 > attr_allow_any_host > echo "${DEV}" > device_path > echo 1 > enable > > mkdir /sys/kernel/config/nvmet/ports/1 > cd /sys/kernel/config/nvmet/ports/1 > echo 4420 > addr_trsvcid > echo 127.0.0.1 > addr_traddr > echo tcp > addr_trtype > echo ipv4 > addr_adrfam > ln -s /sys/kernel/config/nvmet/subsystems/testsubsystem/ > /sys/kernel/config/nvmet/ports/1/subsystems/testsubsystem > > modprobe nvme-tcp > nvme connect -t tcp -n testsubsystem -a 127.0.0.1 -s 4420 > nvme list > nvme delete-ns -n 1 /dev/nvmeX > > > Eventually will finish with Interuptted, but the device will be unusable > after > this. > > The nvme-tcp shows issues in dmesg: > > [10638.139221] nvmet: creating nvm controller 1 for subsystem testsubsystem > for > NQN nqn.2014-08.org.nvmexpress:uuid:88a968b7-a5d7-4549-aaa8-a31f1d25aa4b. > [10638.139677] nvme nvme2: creating 32 I/O queues. > [10638.141556] nvme nvme2: mapped 32/0/0 default/read/poll queues. > [10638.145884] nvme nvme2: new ctrl: NQN "testsubsystem", addr 127.0.0.1:4420 > [10678.676734] nvmet_tcp: failed cmd 00000000fdc07f80 id 4100 opcode 13, > data_len: 0 > [10684.129665] nvme nvme2: queue 0: timeout request 0x0 type 4 The target sees the opcode 13 (0xd) for the delete namespace management command and correcty decides it is unsupported, so the request initialization fails. Here's where it looks like the target code goes wrong: nvmet_tcp_try_recv_one nvmet_tcp_done_recv_pdu nvmet_req_init __nvmet_req_complete nvmet_tcp_queue_response <- cmd is added to the queue resp_list nvmet_tcp_handle_req_failure() -> rcv_state set to NVMET_TCP_RECV_DATA, flags set to NVMET_TCP_F_INIT_FAILED nvmet_tcp_try_recv_data nvmet_tcp_execute_request <- sees NVMET_TCP_F_INIT_FAILED nvmet_tcp_queue_response <- cmd is added to the queue resp_list a 2nd time That can't be right: the cmd shouldn't be added to the response list twice. Sagi, am I missing something here? > [10684.129673] nvme nvme2: starting error recovery > [10684.131074] nvme nvme2: failed nvme_keep_alive_end_io error=10 > [10684.137695] nvme nvme2: Reconnecting in 10 seconds... > [10691.809468] nvmet: ctrl 1 keep-alive timer (5 seconds) expired! > [10691.809519] nvmet: ctrl 1 fatal error occurred! > [10756.832004] nvme nvme2: queue 0: timeout request 0x0 type 4 > [10756.832137] nvme nvme2: Connect command failed, error wo/DNR bit: 881 > [10756.832199] nvme nvme2: failed to connect queue: 0 ret=881 > [10756.832222] nvme nvme2: Failed reconnect attempt 1 > [10756.832225] nvme nvme2: Reconnecting in 10 seconds... > [10814.174824] INFO: task kworker/11:2:11273 blocked for more than 122 > seconds. > [10814.174832] Tainted: G E 5.16.0-rc2 #1 > [10814.174834] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables > this > message. > [10814.174835] task:kworker/11:2 state:D stack: 0 pid:11273 ppid: 2 > flags:0x00004000 > [10814.174841] Workqueue: events nvmet_tcp_release_queue_work [nvmet_tcp] > [10814.174850] Call Trace: > [10814.174852] <TASK> > [10814.174855] __schedule+0x23f/0xe50 > [10814.174861] ? update_load_avg+0x77/0x4d0 > [10814.174866] schedule+0x50/0xd0 > [10814.174869] schedule_timeout+0x99/0xd0 > [10814.174873] ? raw_spin_rq_lock_nested+0x19/0x80 > [10814.174876] wait_for_completion+0x74/0xc0 > [10814.174880] __flush_work.isra.0+0x14b/0x210 > [10814.174884] ? flush_workqueue_prep_pwqs+0x110/0x110 > [10814.174887] nvmet_tcp_release_queue_work+0xc3/0x330 [nvmet_tcp] > [10814.174892] process_one_work+0x1ad/0x310 > [10814.174895] worker_thread+0x48/0x3d0 > [10814.174899] ? process_one_work+0x310/0x310 > [10814.174901] kthread+0x155/0x180 > [10814.174904] ? set_kthread_struct+0x30/0x30 > [10814.174907] ret_from_fork+0x1f/0x30 > [10814.174912] </TASK> > [10830.558291] nvme nvme2: queue 0: timeout request 0x0 type 4 > [10830.558422] nvme nvme2: Connect command failed, error wo/DNR bit: 881 > [10830.558484] nvme nvme2: failed to connect queue: 0 ret=881 > [10830.558507] nvme nvme2: Failed reconnect attempt 2 > [10830.558510] nvme nvme2: Reconnecting in 10 seconds... > [10904.284728] nvme nvme2: queue 0: timeout request 0x0 type 4 > [10904.284757] nvme nvme2: Connect command failed, error wo/DNR bit: 881 > [10904.284818] nvme nvme2: failed to connect queue: 0 ret=881 > [10904.284842] nvme nvme2: Failed reconnect attempt 3 > [10904.284845] nvme nvme2: Reconnecting in 10 seconds...
Created attachment 299799 [details] proposed fix Could you try the attached patch and see if that fixes the issue?
(In reply to Keith Busch from comment #5) > Created attachment 299799 [details] > proposed fix > > Could you try the attached patch and see if that fixes the issue? I will try. Please give me a day or two, as I have unrelated system issues at the moment.
On 11/29/21 8:01 PM, Keith Busch wrote: > [adding Sagi & linux-nvme mailing list] > > On Sat, Nov 27, 2021 at 11:24:57PM +0000, bugzilla-daemon@bugzilla.kernel.org > wrote: >> https://bugzilla.kernel.org/show_bug.cgi?id=215157 > > <snip> > >> Reproduction: >> >> >> modprobe nvmet >> modprobe nvmet-tcp >> truncate --size 1G /tmp/foo1.img >> DEV=$(losetup --find /tmp/foo1.img) >> >> mkdir /sys/kernel/config/nvmet/subsystems/testsubsystem >> cd /sys/kernel/config/nvmet/subsystems/testsubsystem >> echo 1 > attr_allow_any_host >> echo "${DEV}" > device_path >> echo 1 > enable >> >> mkdir /sys/kernel/config/nvmet/ports/1 >> cd /sys/kernel/config/nvmet/ports/1 >> echo 4420 > addr_trsvcid >> echo 127.0.0.1 > addr_traddr >> echo tcp > addr_trtype >> echo ipv4 > addr_adrfam >> ln -s /sys/kernel/config/nvmet/subsystems/testsubsystem/ >> /sys/kernel/config/nvmet/ports/1/subsystems/testsubsystem >> >> modprobe nvme-tcp >> nvme connect -t tcp -n testsubsystem -a 127.0.0.1 -s 4420 >> nvme list >> nvme delete-ns -n 1 /dev/nvmeX >> >> >> Eventually will finish with Interuptted, but the device will be unusable >> after >> this. >> >> The nvme-tcp shows issues in dmesg: >> >> [10638.139221] nvmet: creating nvm controller 1 for subsystem testsubsystem >> for >> NQN nqn.2014-08.org.nvmexpress:uuid:88a968b7-a5d7-4549-aaa8-a31f1d25aa4b. >> [10638.139677] nvme nvme2: creating 32 I/O queues. >> [10638.141556] nvme nvme2: mapped 32/0/0 default/read/poll queues. >> [10638.145884] nvme nvme2: new ctrl: NQN "testsubsystem", addr >> 127.0.0.1:4420 >> [10678.676734] nvmet_tcp: failed cmd 00000000fdc07f80 id 4100 opcode 13, >> data_len: 0 >> [10684.129665] nvme nvme2: queue 0: timeout request 0x0 type 4 > > The target sees the opcode 13 (0xd) for the delete namespace management > command and correcty decides it is unsupported, so the request > initialization fails. Here's where it looks like the target code goes > wrong: > > nvmet_tcp_try_recv_one > nvmet_tcp_done_recv_pdu > nvmet_req_init > __nvmet_req_complete > nvmet_tcp_queue_response <- cmd is added to the queue resp_list > nvmet_tcp_handle_req_failure() -> rcv_state set to NVMET_TCP_RECV_DATA, > flags set to NVMET_TCP_F_INIT_FAILED nvmet_tcp_handle_req_failure will not do anything if the command does not have any incapsule data (which is the case here). > nvmet_tcp_try_recv_data > nvmet_tcp_execute_request <- sees NVMET_TCP_F_INIT_FAILED > nvmet_tcp_queue_response <- cmd is added to the queue resp_list a 2nd > time There is a subtle case here as if this is a command that we are queueing back is the current command being processed (which is the case if we want to fail a command that we still need to accept the incapsule data) we don't attach it to the resp_list. I do agree this is convoluted and needs a cleaner failure handling... > > That can't be right: the cmd shouldn't be added to the response list > twice. > > Sagi, am I missing something here? > >> [10684.129673] nvme nvme2: starting error recovery >> [10684.131074] nvme nvme2: failed nvme_keep_alive_end_io error=10 >> [10684.137695] nvme nvme2: Reconnecting in 10 seconds... >> [10691.809468] nvmet: ctrl 1 keep-alive timer (5 seconds) expired! >> [10691.809519] nvmet: ctrl 1 fatal error occurred! >> [10756.832004] nvme nvme2: queue 0: timeout request 0x0 type 4 >> [10756.832137] nvme nvme2: Connect command failed, error wo/DNR bit: 881 >> [10756.832199] nvme nvme2: failed to connect queue: 0 ret=881 >> [10756.832222] nvme nvme2: Failed reconnect attempt 1 >> [10756.832225] nvme nvme2: Reconnecting in 10 seconds... >> [10814.174824] INFO: task kworker/11:2:11273 blocked for more than 122 >> seconds. >> [10814.174832] Tainted: G E 5.16.0-rc2 #1 >> [10814.174834] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables >> this >> message. >> [10814.174835] task:kworker/11:2 state:D stack: 0 pid:11273 ppid: >> 2 >> flags:0x00004000 >> [10814.174841] Workqueue: events nvmet_tcp_release_queue_work [nvmet_tcp] >> [10814.174850] Call Trace: >> [10814.174852] <TASK> >> [10814.174855] __schedule+0x23f/0xe50 >> [10814.174861] ? update_load_avg+0x77/0x4d0 >> [10814.174866] schedule+0x50/0xd0 >> [10814.174869] schedule_timeout+0x99/0xd0 >> [10814.174873] ? raw_spin_rq_lock_nested+0x19/0x80 >> [10814.174876] wait_for_completion+0x74/0xc0 >> [10814.174880] __flush_work.isra.0+0x14b/0x210 >> [10814.174884] ? flush_workqueue_prep_pwqs+0x110/0x110 >> [10814.174887] nvmet_tcp_release_queue_work+0xc3/0x330 [nvmet_tcp] >> [10814.174892] process_one_work+0x1ad/0x310 >> [10814.174895] worker_thread+0x48/0x3d0 >> [10814.174899] ? process_one_work+0x310/0x310 >> [10814.174901] kthread+0x155/0x180 >> [10814.174904] ? set_kthread_struct+0x30/0x30 >> [10814.174907] ret_from_fork+0x1f/0x30 >> [10814.174912] </TASK> >> [10830.558291] nvme nvme2: queue 0: timeout request 0x0 type 4 >> [10830.558422] nvme nvme2: Connect command failed, error wo/DNR bit: 881 >> [10830.558484] nvme nvme2: failed to connect queue: 0 ret=881 >> [10830.558507] nvme nvme2: Failed reconnect attempt 2 >> [10830.558510] nvme nvme2: Reconnecting in 10 seconds... >> [10904.284728] nvme nvme2: queue 0: timeout request 0x0 type 4 >> [10904.284757] nvme nvme2: Connect command failed, error wo/DNR bit: 881 >> [10904.284818] nvme nvme2: failed to connect queue: 0 ret=881 >> [10904.284842] nvme nvme2: Failed reconnect attempt 3 >> [10904.284845] nvme nvme2: Reconnecting in 10 seconds... The issue here seems to be that the controller fails the command but it still times out for the host - which suggest that the error response is not actually queued.
>>> [10814.174841] Workqueue: events nvmet_tcp_release_queue_work >>> [nvmet_tcp] >>> [10814.174850] Call Trace: >>> [10814.174852] <TASK> >>> [10814.174855] __schedule+0x23f/0xe50 >>> [10814.174861] ? update_load_avg+0x77/0x4d0 >>> [10814.174866] schedule+0x50/0xd0 >>> [10814.174869] schedule_timeout+0x99/0xd0 >>> [10814.174873] ? raw_spin_rq_lock_nested+0x19/0x80 >>> [10814.174876] wait_for_completion+0x74/0xc0 >>> [10814.174880] __flush_work.isra.0+0x14b/0x210 >>> [10814.174884] ? flush_workqueue_prep_pwqs+0x110/0x110 >>> [10814.174887] nvmet_tcp_release_queue_work+0xc3/0x330 [nvmet_tcp] >>> [10814.174892] process_one_work+0x1ad/0x310 >>> [10814.174895] worker_thread+0x48/0x3d0 >>> [10814.174899] ? process_one_work+0x310/0x310 >>> [10814.174901] kthread+0x155/0x180 >>> [10814.174904] ? set_kthread_struct+0x30/0x30 >>> [10814.174907] ret_from_fork+0x1f/0x30 >>> [10814.174912] </TASK> >>> [10830.558291] nvme nvme2: queue 0: timeout request 0x0 type 4 >>> [10830.558422] nvme nvme2: Connect command failed, error wo/DNR bit: 881 >>> [10830.558484] nvme nvme2: failed to connect queue: 0 ret=881 >>> [10830.558507] nvme nvme2: Failed reconnect attempt 2 >>> [10830.558510] nvme nvme2: Reconnecting in 10 seconds... >>> [10904.284728] nvme nvme2: queue 0: timeout request 0x0 type 4 >>> [10904.284757] nvme nvme2: Connect command failed, error wo/DNR bit: 881 >>> [10904.284818] nvme nvme2: failed to connect queue: 0 ret=881 >>> [10904.284842] nvme nvme2: Failed reconnect attempt 3 >>> [10904.284845] nvme nvme2: Reconnecting in 10 seconds... > > The issue here seems to be that the controller fails the command but it > still times out for the host - which suggest that the error response is > not actually queued. This should resolve the issue: -- diff --git a/drivers/nvme/target/tcp.c b/drivers/nvme/target/tcp.c index cb6a473c3eaf..a91ae06abce0 100644 --- a/drivers/nvme/target/tcp.c +++ b/drivers/nvme/target/tcp.c @@ -922,7 +922,7 @@ static void nvmet_tcp_handle_req_failure(struct nvmet_tcp_queue *queue, size_t data_len = le32_to_cpu(req->cmd->common.dptr.sgl.length); int ret; - if (!nvme_is_write(cmd->req.cmd) || + if (!nvme_is_write(cmd->req.cmd) || !data_len || data_len > cmd->req.port->inline_data_size) { nvmet_prepare_receive_pdu(queue); return; --
> > This should resolve the issue: > -- > diff --git a/drivers/nvme/target/tcp.c b/drivers/nvme/target/tcp.c > index cb6a473c3eaf..a91ae06abce0 100644 > --- a/drivers/nvme/target/tcp.c > +++ b/drivers/nvme/target/tcp.c > @@ -922,7 +922,7 @@ static void nvmet_tcp_handle_req_failure(struct > nvmet_tcp_queue *queue, > size_t data_len = le32_to_cpu(req->cmd->common.dptr.sgl.length); > int ret; > > - if (!nvme_is_write(cmd->req.cmd) || > + if (!nvme_is_write(cmd->req.cmd) || !data_len || > data_len > cmd->req.port->inline_data_size) { > nvmet_prepare_receive_pdu(queue); > return; > -- Just tested this, and it is looking good. # nvme delete-ns -n 1 /dev/nvme1 NVMe status: INVALID_OPCODE: The associated command opcode field is not valid(0x6001) # dmesg: [ 937.078812] nvmet_tcp: failed cmd 0000000059bc4046 id 4105 opcode 13, data_len: 0 I also verified the `nvme create-ns` and they correctly fail too, and cause no other issues. After any of tested commands, the device and OS is fully usable, and there are no weird issues. I did not test things extensively, but will be playing a little bit to see if there are any regressions.