Bug 219039

Summary: kernel 6.6.39 freezes with QNAP TL-D800C usb case
Product: Drivers Reporter: Matthias (matthias)
Component: USBAssignee: Default virtual assignee for Drivers/USB (drivers_usb)
Status: RESOLVED CODE_FIX    
Severity: normal CC: carnil, elatllat, ZeroBeat
Priority: P3    
Hardware: AMD   
OS: Linux   
URL: https://github.com/torvalds/linux/commit/66cb618bf0bb82859875b00eeffaf223557cb416
Kernel Version: 6.6.39 Subsystem:
Regression: Yes Bisected commit-id: 66cb618bf0bb82859875b00eeffaf223557cb416

Description Matthias 2024-07-14 12:11:14 UTC
When I turn on my QNAP DL-D800C with 8 HD, the kernel 6.6.39 freezes immediately. kernels 6.6.38 and 6.6.37 are not doing that. 

I can reproduce with arch kernel linux-lts 6.6.39-1 and with linux66-tkg-eevdf-generic_v3 6.6.39-273.1

I identified commit 9a24eb8010c2dc6a2eba56e3eb9fc07d14ffe00a as the root cause:

commit 9a24eb8010c2dc6a2eba56e3eb9fc07d14ffe00a
Author: Niklas Neronin <niklas.neronin@linux.intel.com>
Date:   Mon Apr 29 17:02:37 2024 +0300

    usb: xhci: prevent potential failure in handle_tx_event() for Transfer events without TRB
    
    [ Upstream commit 66cb618bf0bb82859875b00eeffaf223557cb416 ]
    
    Some transfer events don't always point to a TRB, and consequently don't
    have a endpoint ring. In these cases, function handle_tx_event() should
    not proceed, because if 'ep->skip' is set, the pointer to the endpoint
    ring is used.
    
    To prevent a potential failure and make the code logical, return after
    checking the completion code for a Transfer event without TRBs.
    
    Signed-off-by: Niklas Neronin <niklas.neronin@linux.intel.com>
    Signed-off-by: Mathias Nyman <mathias.nyman@linux.intel.com>
    Link: https://lore.kernel.org/r/20240429140245.3955523-11-mathias.nyman@linux.intel.com
    Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
    Signed-off-by: Sasha Levin <sashal@kernel.org>


When I revert that patch, the kernel does not freeze and everything is fine.
Comment 1 Matthias 2024-07-14 12:15:21 UTC
EDIT: 
USB case is
Qnap NAS Storage TOWER 8BAY/TL-D800C 
https://www.qnap.com/de-de/product/tl-d800c
Comment 2 Michael 2024-07-14 12:50:42 UTC
I can confirm that.

My case:
ID 152d:0578 JMicron Technology Corp. / JMicron USA Technology Corp. JMS578 SATA 6Gb/s

Kernel freeze immediately after the case is plugged to an USB connector.

Linux stable kernel (6.9.9) is not affected.
Comment 3 MichaƂ Pecio 2024-07-14 14:03:13 UTC
Hi,

Good job finding the bad commit. I'm almost sure that I know what the
problem is - it appears that this commit effectively disables
incremeting event ring pointer under certain conditions, which causes
the IRQ handler to process the same event again and again and never
progress to the next one.

I would try changing the

    return 0;

added by the bad commit into

    inc_deq(xhci, ir->event_ring);
    return 0;

and see if this helps.


Other than that, the commit appears to be valid fix to a real problem
that may happen on isochronous endpoints when MISSED_SERVICE_ERROR is
immediately followed by STOPPED_LENGTH_INVALID, which I think could
occur. The over/underrun conditions are perhaps possible too.

Regards,
Michal
Comment 4 Matthias 2024-07-14 15:50:18 UTC
Does your proposed solution explain why the bug does not show in 6.9.9 although it contains the same commit?
Comment 5 elatllat 2024-07-14 16:28:19 UTC
Related: https://lkml.org/lkml/2024/7/13/328
Comment 6 Matthias 2024-07-14 16:32:54 UTC
I created a patch file for 6.6.39 based on Michael's proposal (comment #3):

--- a/drivers/usb/host/xhci-ring.c
+++ b/drivers/usb/host/xhci-ring.c
@@ -2657,6 +2657,7 @@
 				 slot_id, ep_index);
 			goto err_out;
 		}
+		inc_deq(xhci, ir->event_ring);
 		return 0;
 	}


I can confirm that this prevents the freezing. The device and the attached hard discs work normal as far as I can tell. I did a fio benchmark run on one of the zfs pools in that QNAP case and it executed just fine with the usual results.
Comment 7 elatllat 2024-07-15 11:53:14 UTC
 RESOLVED with 6.1.99 and 6.6.40 .