When I turn on my QNAP DL-D800C with 8 HD, the kernel 6.6.39 freezes immediately. kernels 6.6.38 and 6.6.37 are not doing that. I can reproduce with arch kernel linux-lts 6.6.39-1 and with linux66-tkg-eevdf-generic_v3 6.6.39-273.1 I identified commit 9a24eb8010c2dc6a2eba56e3eb9fc07d14ffe00a as the root cause: commit 9a24eb8010c2dc6a2eba56e3eb9fc07d14ffe00a Author: Niklas Neronin <niklas.neronin@linux.intel.com> Date: Mon Apr 29 17:02:37 2024 +0300 usb: xhci: prevent potential failure in handle_tx_event() for Transfer events without TRB [ Upstream commit 66cb618bf0bb82859875b00eeffaf223557cb416 ] Some transfer events don't always point to a TRB, and consequently don't have a endpoint ring. In these cases, function handle_tx_event() should not proceed, because if 'ep->skip' is set, the pointer to the endpoint ring is used. To prevent a potential failure and make the code logical, return after checking the completion code for a Transfer event without TRBs. Signed-off-by: Niklas Neronin <niklas.neronin@linux.intel.com> Signed-off-by: Mathias Nyman <mathias.nyman@linux.intel.com> Link: https://lore.kernel.org/r/20240429140245.3955523-11-mathias.nyman@linux.intel.com Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: Sasha Levin <sashal@kernel.org> When I revert that patch, the kernel does not freeze and everything is fine.
EDIT: USB case is Qnap NAS Storage TOWER 8BAY/TL-D800C https://www.qnap.com/de-de/product/tl-d800c
I can confirm that. My case: ID 152d:0578 JMicron Technology Corp. / JMicron USA Technology Corp. JMS578 SATA 6Gb/s Kernel freeze immediately after the case is plugged to an USB connector. Linux stable kernel (6.9.9) is not affected.
Hi, Good job finding the bad commit. I'm almost sure that I know what the problem is - it appears that this commit effectively disables incremeting event ring pointer under certain conditions, which causes the IRQ handler to process the same event again and again and never progress to the next one. I would try changing the return 0; added by the bad commit into inc_deq(xhci, ir->event_ring); return 0; and see if this helps. Other than that, the commit appears to be valid fix to a real problem that may happen on isochronous endpoints when MISSED_SERVICE_ERROR is immediately followed by STOPPED_LENGTH_INVALID, which I think could occur. The over/underrun conditions are perhaps possible too. Regards, Michal
Does your proposed solution explain why the bug does not show in 6.9.9 although it contains the same commit?
Related: https://lkml.org/lkml/2024/7/13/328
I created a patch file for 6.6.39 based on Michael's proposal (comment #3): --- a/drivers/usb/host/xhci-ring.c +++ b/drivers/usb/host/xhci-ring.c @@ -2657,6 +2657,7 @@ slot_id, ep_index); goto err_out; } + inc_deq(xhci, ir->event_ring); return 0; } I can confirm that this prevents the freezing. The device and the attached hard discs work normal as far as I can tell. I did a fio benchmark run on one of the zfs pools in that QNAP case and it executed just fine with the usual results.
RESOLVED with 6.1.99 and 6.6.40 .