Bug 219039 - kernel 6.6.39 freezes with QNAP TL-D800C usb case
Summary: kernel 6.6.39 freezes with QNAP TL-D800C usb case
Status: RESOLVED CODE_FIX
Alias: None
Product: Drivers
Classification: Unclassified
Component: USB (show other bugs)
Hardware: AMD Linux
: P3 normal
Assignee: Default virtual assignee for Drivers/USB
URL: https://github.com/torvalds/linux/com...
Keywords:
Depends on:
Blocks:
 
Reported: 2024-07-14 12:11 UTC by Matthias
Modified: 2024-07-16 07:53 UTC (History)
3 users (show)

See Also:
Kernel Version: 6.6.39
Subsystem:
Regression: Yes
Bisected commit-id: 66cb618bf0bb82859875b00eeffaf223557cb416


Attachments

Description Matthias 2024-07-14 12:11:14 UTC
When I turn on my QNAP DL-D800C with 8 HD, the kernel 6.6.39 freezes immediately. kernels 6.6.38 and 6.6.37 are not doing that. 

I can reproduce with arch kernel linux-lts 6.6.39-1 and with linux66-tkg-eevdf-generic_v3 6.6.39-273.1

I identified commit 9a24eb8010c2dc6a2eba56e3eb9fc07d14ffe00a as the root cause:

commit 9a24eb8010c2dc6a2eba56e3eb9fc07d14ffe00a
Author: Niklas Neronin <niklas.neronin@linux.intel.com>
Date:   Mon Apr 29 17:02:37 2024 +0300

    usb: xhci: prevent potential failure in handle_tx_event() for Transfer events without TRB
    
    [ Upstream commit 66cb618bf0bb82859875b00eeffaf223557cb416 ]
    
    Some transfer events don't always point to a TRB, and consequently don't
    have a endpoint ring. In these cases, function handle_tx_event() should
    not proceed, because if 'ep->skip' is set, the pointer to the endpoint
    ring is used.
    
    To prevent a potential failure and make the code logical, return after
    checking the completion code for a Transfer event without TRBs.
    
    Signed-off-by: Niklas Neronin <niklas.neronin@linux.intel.com>
    Signed-off-by: Mathias Nyman <mathias.nyman@linux.intel.com>
    Link: https://lore.kernel.org/r/20240429140245.3955523-11-mathias.nyman@linux.intel.com
    Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
    Signed-off-by: Sasha Levin <sashal@kernel.org>


When I revert that patch, the kernel does not freeze and everything is fine.
Comment 1 Matthias 2024-07-14 12:15:21 UTC
EDIT: 
USB case is
Qnap NAS Storage TOWER 8BAY/TL-D800C 
https://www.qnap.com/de-de/product/tl-d800c
Comment 2 Michael 2024-07-14 12:50:42 UTC
I can confirm that.

My case:
ID 152d:0578 JMicron Technology Corp. / JMicron USA Technology Corp. JMS578 SATA 6Gb/s

Kernel freeze immediately after the case is plugged to an USB connector.

Linux stable kernel (6.9.9) is not affected.
Comment 3 Michał Pecio 2024-07-14 14:03:13 UTC
Hi,

Good job finding the bad commit. I'm almost sure that I know what the
problem is - it appears that this commit effectively disables
incremeting event ring pointer under certain conditions, which causes
the IRQ handler to process the same event again and again and never
progress to the next one.

I would try changing the

    return 0;

added by the bad commit into

    inc_deq(xhci, ir->event_ring);
    return 0;

and see if this helps.


Other than that, the commit appears to be valid fix to a real problem
that may happen on isochronous endpoints when MISSED_SERVICE_ERROR is
immediately followed by STOPPED_LENGTH_INVALID, which I think could
occur. The over/underrun conditions are perhaps possible too.

Regards,
Michal
Comment 4 Matthias 2024-07-14 15:50:18 UTC
Does your proposed solution explain why the bug does not show in 6.9.9 although it contains the same commit?
Comment 5 elatllat 2024-07-14 16:28:19 UTC
Related: https://lkml.org/lkml/2024/7/13/328
Comment 6 Matthias 2024-07-14 16:32:54 UTC
I created a patch file for 6.6.39 based on Michael's proposal (comment #3):

--- a/drivers/usb/host/xhci-ring.c
+++ b/drivers/usb/host/xhci-ring.c
@@ -2657,6 +2657,7 @@
 				 slot_id, ep_index);
 			goto err_out;
 		}
+		inc_deq(xhci, ir->event_ring);
 		return 0;
 	}


I can confirm that this prevents the freezing. The device and the attached hard discs work normal as far as I can tell. I did a fio benchmark run on one of the zfs pools in that QNAP case and it executed just fine with the usual results.
Comment 7 elatllat 2024-07-15 11:53:14 UTC
 RESOLVED with 6.1.99 and 6.6.40 .

Note You need to log in before you can comment on or make changes to this bug.