Bug 207065

Summary: C-media USB audio device stops working from 5.2.0-rc3 onwards
Product: Drivers Reporter: Florian Meyer (florianmey)
Component: USBAssignee: Default virtual assignee for Drivers/USB (drivers_usb)
Status: NEW ---    
Severity: normal CC: hawk.it, jamielennox, leho, mail, mathias.nyman, miren_radia, o.freyermuth
Priority: P1    
Hardware: All   
OS: Linux   
Kernel Version: >=5.2.0 Subsystem:
Regression: Yes Bisected commit-id:
Attachments: Output of lsusb -v
dmesg output with usb debug enabled
Content of /sys/kernel/debug/tracing/trace
testpatch that doesn't clear TT buffer after protocol STALL
Fix typo in attachment 288293
Dmesg output with both patches applied
Content of /sys/kernel/debug/tracing/trace with both patches applied
Full dmesg output with both patches applied
DMESG log from todays lockup.
Older dmesg log with debug information added
Content of /sys/kernel/debug/tracing/trace
Content of /sys/kernel/debug/tracing/trace

Description Florian Meyer 2020-04-02 15:57:49 UTC
Created attachment 288159 [details]
Output of lsusb -v

Hi.
I have an external USB audio device (Bus 003 Device 005: ID 0d8c:000c C-Media Electronics, Inc. Audio Adapter) / USB hub (Bus 003 Device 006: ID 058f:6254 Alcor Micro Corp. USB Hub). While the USB hub side of it is still working, the audio device fails from kernel version 5.2.0-rc3 onwards. 
The error is this:
speaker-test -c2 -D hw:1,0

speaker-test 1.2.2

Playback device is hw:1,0
Stream parameters are 48000Hz, S16_LE, 2 channels
Using 16 octaves of pink noise
Rate set to 48000Hz (requested 48000Hz)
Buffer size range from 96 to 262144
Period size range from 48 to 131072
Using max buffer size 262144
Periods = 4
Unable to set hw params for playback: Input/output error
Setting of hwparams failed: Input/output error

While dmesg shows: [ 1287.634587] usb 1-2.4: 1:1: usb_set_interface failed (-110)

I bisected it and got to the following commit as the culprit:
# first bad commit: [ef513be0a9057cc6baf5d29566aaaefa214ba344] usb: xhci: Add Clear_TT_Buffer
Last working one was:
good: [4998f1efd1904dd21697aeeead270e3eb97691dd] usb: Add devaddr in struct usb_device

Current head gives me the same error messages as the ones above, so the bug is still there.

I first added my report to a very similar bug (https://bugzilla.kernel.org/show_bug.cgi?id=203419) but the kernel version at which the problem started for me does not fit the original report there, which is why I am opening this as a new report.
Comment 1 Florian Meyer 2020-04-02 16:00:52 UTC
Created attachment 288161 [details]
dmesg output with usb debug enabled

As per Mathias Nyman's comment over at the other bug report I'm also attaching the requested debug logs from this:

mount -t debugfs none /sys/kernel/debug
echo 'module xhci_hcd =p' >/sys/kernel/debug/dynamic_debug/control
echo 'module usbcore =p' >/sys/kernel/debug/dynamic_debug/control
echo 81920 > /sys/kernel/debug/tracing/buffer_size_kb
echo 1 > /sys/kernel/debug/tracing/events/xhci-hcd/enable
< connect the logitec device >
Send output of dmesg
Send content of /sys/kernel/debug/tracing/trace
Comment 2 Florian Meyer 2020-04-02 16:01:31 UTC
Created attachment 288163 [details]
Content of /sys/kernel/debug/tracing/trace
Comment 3 Florian Meyer 2020-04-08 10:08:03 UTC
There is also another bug which was bisected to the same commit under https://bugzilla.kernel.org/show_bug.cgi?id=206897
But that one is not related to USB audio devices.

Also there has been another bug related to this commit from last year, which was discussed and fixed here:
https://www.spinics.net/lists/linux-usb/msg182970.html
Comment 4 Mathias Nyman 2020-04-09 12:32:05 UTC
Thanks, traces show its related to Clearing TT buffer after a STALL on 
endpoint 0.

The first stall looks like a protocol stall, not a function stall, meaning that
endpoint isn't really halted, just that the device does not support the
request in the control transfer. 

Anyway, xhci starts clearing what it assumes is a halted endpoint, 
including clearing the hub TT buffer.

Specs are a bit unclear if TT should be cleared in this case, 
or at least I couldn't find it.

I'll create a tespatch that doesn't clear TT buffer on protocol stalls,
and add excessive tracing and debugging to it.
Comment 5 Mathias Nyman 2020-04-09 12:34:32 UTC
Created attachment 288293 [details]
testpatch that doesn't clear TT buffer after protocol STALL

Does this help?  Could you share similar logs and traces with this applied?
Thanks
Comment 6 Alan Stern 2020-04-09 14:01:41 UTC
On Thu, 9 Apr 2020 bugzilla-daemon@bugzilla.kernel.org wrote:

> --- Comment #4 from Mathias Nyman (mathias.nyman@linux.intel.com) ---
> Thanks, traces show its related to Clearing TT buffer after a STALL on 
> endpoint 0.
> 
> The first stall looks like a protocol stall, not a function stall, meaning
> that
> endpoint isn't really halted, just that the device does not support the
> request in the control transfer. 
> 
> Anyway, xhci starts clearing what it assumes is a halted endpoint, 
> including clearing the hub TT buffer.
> 
> Specs are a bit unclear if TT should be cleared in this case, 
> or at least I couldn't find it.

TTs should be cleared when an error occurs in the protocol.  STALL
isn't an error; a real error would be a timeout or CRC mismatch or URB
cancellation, things like that -- things that might leave the TT in a
busy state (because it hasn't sent its final status back to the host)
when it ought to be idle.  Sending a STALL isn't a protocol error and
it does clear the TT status.

This is discussed (not as explicitly as one might want) in sections 
11.17.3 and 11.17.5 of the USB-2.0 specification.

Alan Stern
Comment 7 Vincenzo Di Massa 2020-04-10 17:58:25 UTC
Created attachment 288331 [details]
Fix typo in attachment 288293 [details]

bug 206897 is resolved with the patch in attachment 288293 [details], there is a typo in the patch.
This patch just fixes the typo.

Apply in order

0001-xhci-testpatch-Don-t-clear-TT-buffer-on-ep0-protocol.patch
0002-Fix-a-typo-in-prevoius-patch.patch
Comment 8 Mathias Nyman 2020-04-14 10:07:04 UTC
> TTs should be cleared when an error occurs in the protocol.  STALL
> isn't an error; a real error would be a timeout or CRC mismatch or URB
> cancellation, things like that -- things that might leave the TT in a
> busy state (because it hasn't sent its final status back to the host)
> when it ought to be idle.  Sending a STALL isn't a protocol error and
> it does clear the TT status.
> 
> This is discussed (not as explicitly as one might want) in sections 
> 11.17.3 and 11.17.5 of the USB-2.0 specification.
> 

Thanks for the clarification, makes sense.
Comment 9 Florian Meyer 2020-04-14 16:30:02 UTC
With both patches applied the sound system is working.
Thanks a lot!
I will attach the requested logs with the patch applied as well.
Comment 10 Florian Meyer 2020-04-14 16:30:44 UTC
Created attachment 288447 [details]
Dmesg output with both patches applied
Comment 11 Florian Meyer 2020-04-14 16:34:02 UTC
Created attachment 288449 [details]
Content of /sys/kernel/debug/tracing/trace with both patches applied

Gzipped because the file was too big to attach here.
Comment 12 Florian Meyer 2020-04-16 13:00:20 UTC
Created attachment 288523 [details]
Full dmesg output with both patches applied

I just realized that the last dmesg log was truncated for some reason. This one is complete.
Comment 13 Florian Meyer 2020-09-29 10:58:12 UTC
So I'm not entirely sure if this is related to my previous problem but I'm now encountering a different kind of USB problems. USB equipment is still the same as described in the first message. 
The new problem leads to all USB devices getting disconnected, USB audio as well as mouse. The USB keyboard continues working because I guess it is working via PS2 internally?
The problem has been occurring pretty much since the previous issue was fixed but only sporadically. Today it occurred for the first time in ~4 weeks and sometimes it happened several times in a day. 
The last messages in the dmesg log are:

[ 7075.685892] xhci_hcd 0000:2a:00.3: xHCI host not responding to stop endpoint command.
[ 7075.685895] xhci_hcd 0000:2a:00.3: USBSTS:
[ 7075.701768] xhci_hcd 0000:2a:00.3: xHCI host controller not responding, assume dead
[ 7075.701775] xhci_hcd 0000:2a:00.3: HC died; cleaning up
[ 7075.701808] usb 3-1: USB disconnect, device number 2
[ 7075.766108] usb 3-2: USB disconnect, device number 3
[ 7075.766110] usb 3-2.4: USB disconnect, device number 4
Comment 14 Florian Meyer 2020-09-29 10:59:15 UTC
Created attachment 292705 [details]
DMESG log from todays lockup.
Comment 15 Florian Meyer 2020-09-29 11:02:42 UTC
Created attachment 292707 [details]
Older dmesg log with debug information added

Log created as per the previous instructions (message 2). I just enabled the settings on system start in the hopes of capturing the hang so the files are quite big.
Comment 16 Florian Meyer 2020-09-29 11:50:44 UTC
Created attachment 292709 [details]
Content of /sys/kernel/debug/tracing/trace

Created as per previous comment. Uploaded at:
Comment 17 Florian Meyer 2020-09-29 11:53:09 UTC
Created attachment 292711 [details]
Content of /sys/kernel/debug/tracing/trace

Unpacked file is ~200mb. Ran for the entirety of the systems uptime to be able to catch the bug.