Bug 199323

Summary: DVBSky USB TV tuners do not work since 4.10 due to mutex issues
Product: Drivers Reporter: Olli Salonen (olli.salonen)
Component: Video(Other)Assignee: drivers_video-other
Status: RESOLVED CODE_FIX    
Severity: normal CC: berndkuhls, forum, forum, guido.aulisi, jahutchinson99, kernel, lbm, n_alp, o.freyermuth, olli.salonen, peterz, sergey79, stefan.bruens, wolfgang
Priority: P1    
Hardware: All   
OS: Linux   
Kernel Version: 4.10 Subsystem:
Regression: Yes Bisected commit-id:
Attachments: Debug logging for I2C
media: dvbsky: use a single mutex and state buffers for all R/W ops
mutex fix
syslog before reboot
m88ds3103-serialize-reset-messages.patch

Description Olli Salonen 2018-04-08 17:36:40 UTC
When using kernel 4.10 or newer DVBSky S960 and S960CI satellite USB TV tuners do not work properly. Basically, they work initially but will fail at one point when tuning to a channel. This typically takes less than 100 tuning attempts. 

After the tuner fails, I need to reboot the PC (probably unloading the
driver and loading it again would do).

The tvheadend application complains about failing to send disecq commands, DVBv5 scan gives another error.

2018-04-04 10:17:36.756 [   INFO] mpegts: 12149H in 4.8E - tuning on
Montage Technology M88DS3103 : DVB-S #0
2018-04-04 10:17:37.159 [  ERROR] diseqc: failed to send diseqc cmd
(e=Connection timed out)
2018-04-04 10:17:37.160 [   INFO] mpegts: 12265H in 4.8E - tuning on
Montage Technology M88DS3103 : DVB-S #0
2018-04-04 10:17:37.535 [  ERROR] diseqc: failed to send diseqc cmd
(e=Connection timed out)

I did a kernel bisect between 4.9 and 4.10. It seems the commit that
breaks my tuner is the following one:

9d659ae14b545c4296e812c70493bfdc999b5c1c is the first bad commit
commit 9d659ae14b545c4296e812c70493bfdc999b5c1c
Author: Peter Zijlstra <peterz@infradead.org>
Date:   Tue Aug 23 14:40:16 2016 +0200

    locking/mutex: Add lock handoff to avoid starvation

I couldn't easily revert that commit only. I can see that the
drivers/media/usb/dvb-usb-v2/dvbsky.c driver does use mutex_lock() and
mutex_lock_interruptible() in a few places.
Comment 1 Olli Salonen 2018-04-18 05:12:11 UTC
The diseqc command failures are due to I2C read failures.

When the problem starts it can be seen that any I2C read returns "07 ff 49 00" or a subset of that depending on how many bytes are read.

In the log snippet below it can be seen that after the request:
Apr 09 10:33:46 nucserver kernel: dvb_usb_v2:dvb_usb_v2_generic_io: usb 1-1: dvb_usb_v2_generic_io: >>> 08 68 02 03 11
Apr 09 10:33:46 nucserver kernel: dvb_usb_v2:dvb_usb_v2_generic_io: usb 1-1: dvb_usb_v2_generic_io: <<< 07

Any I2C response back starts with 07.

Apr 09 10:33:46 nucserver kernel: dvb_core:dvb_frontend_swzigzag_autotune: usb 1-1: dvb_frontend_swzigzag_autotune: drift:0 inversion:0 auto_step:2 auto_sub_step:0 started_auto_step:0
Apr 09 10:33:46 nucserver kernel: m88ds3103:m88ds3103_set_frontend: m88ds3103 4-0068: delivery_system=5 modulation=0 frequency=2119000 symbol_rate=2960000 inversion=2 pilot=2 rolloff=0
Apr 09 10:33:46 nucserver kernel: dvb_usb_v2:dvb_usb_v2_generic_io: usb 1-1: dvb_usb_v2_generic_io: >>> 08 68 02 07 80
Apr 09 10:33:46 nucserver kernel: dvb_usb_v2:dvb_usb_v2_generic_io: usb 1-1: dvb_usb_v2_generic_io: <<< 08
Apr 09 10:33:46 nucserver kernel: dvb_usb_v2:dvb_usb_v2_generic_io: usb 1-1: dvb_usb_v2_generic_io: >>> 09 01 01 68 3f
Apr 09 10:33:46 nucserver kernel: dvb_usb_v2:dvb_usb_v2_generic_io: usb 1-1: dvb_usb_v2_generic_io: <<< 08 ff
Apr 09 10:33:46 nucserver kernel: dvb_usb_v2:dvb_usb_v2_generic_io: usb 1-1: dvb_usb_v2_generic_io: >>> 08 68 02 03 11
Apr 09 10:33:46 nucserver kernel: dvb_usb_v2:dvb_usb_v2_generic_io: usb 1-1: dvb_usb_v2_generic_io: <<< 07
Apr 09 10:33:46 nucserver kernel: dvb_usb_v2:dvb_usb_v2_generic_io: usb 1-1: dvb_usb_v2_generic_io: >>> 09 01 01 60 3d
Apr 09 10:33:46 nucserver kernel: dvb_usb_v2:dvb_usb_v2_generic_io: usb 1-1: dvb_usb_v2_generic_io: <<< 07 ff
Apr 09 10:33:46 nucserver kernel: dvb_usb_v2:dvb_usb_v2_generic_io: usb 1-1: dvb_usb_v2_generic_io: >>> 08 68 02 07 00
Apr 09 10:33:46 nucserver kernel: dvb_usb_v2:dvb_usb_v2_generic_io: usb 1-1: dvb_usb_v2_generic_io: <<< 07
Apr 09 10:33:46 nucserver kernel: ts2020:ts2020_set_params: frequency=2119500 offset=500 f_vco_khz=4238000 pll_n=2198 div_ref=14 div_out=2
Apr 09 10:33:46 nucserver kernel: dvb_usb_v2:dvb_usb_v2_generic_io: usb 1-1: dvb_usb_v2_generic_io: >>> 08 68 02 03 11
Apr 09 10:33:46 nucserver kernel: dvb_usb_v2:dvb_usb_v2_generic_io: usb 1-1: dvb_usb_v2_generic_io: <<< 07

I've now confirmed that I can get a 4.10 kernel with working DVBSky S960 by reverting the following 4 patches:

549bdd3 Revert "locking/mutex: Add lock handoff to avoid starvation"
3210f31 Revert "locking/mutex: Restructure wait loop"
418a170 Revert "locking/mutex: Simplify some ww_mutex code in
__mutex_lock_common()"
0b1fb8f Revert "locking/mutex: Enable optimistic spinning of woken waiter"
c470abd Linux 4.10
Comment 2 Olli Salonen 2018-04-18 05:12:42 UTC
Created attachment 275433 [details]
Debug logging for I2C
Comment 3 forum 2018-05-30 16:49:07 UTC
This bug is also present in kernel 4.9.35 and in 4.4.50.

I downgraded 4.9.35 to 4.4.50 with same error, after approx 20 tuning requests all subsequent tuning requests fail. Upgraded kernel to 4.9.13, error occurred but after many more tuning requests. For me the error occurs but downgrading does not reproduce the error.
Comment 4 Alexander Görtz 2018-06-23 14:46:41 UTC
This bug is still present in Kernel 4.17.2 and renders the widely spread DVBSky 960 v2 useless in Linux. I’m not sure if I’m skilled enough to revert the four patches and rebuild the kernel but I will try.
I hope this will be fixed in a future kernel release.
Comment 5 sergey79 2018-08-15 19:42:35 UTC
Probably the same issue here after upgrading from kernel 4.9.80 to 4.14.62. After the signal is lost (e.g. due to bad weather) and when it comes back - the tuner (DVBSky S960) does not lock the signal anymore. A reboot is required. I was able to reproduce this issue multiple times by blocking the signal from the LNB. Never had this issue with kernel 4.9.
Comment 6 James 2018-08-16 10:51:05 UTC
I also have this same problem with the DVBSky S960 USB adapter after upgrading to kernel v4.9.111. After a number of tuning requests the device locks up with timeout (-110) errors repeating in dmesg:

[934966.146448] m88ds3103 1-0068: i2c wr failed=-110
[934968.283100] usb 1-1: dvb_usb_v2: usb_bulk_msg() failed=-110
[934968.283107] usb 1-1: failed=-110

As reported by others, a reboot is required in order to clear the problem.
Comment 7 sergey79 2018-08-16 23:49:02 UTC
In kernel 4.18.0 dvbsky.c driver is completely broken - commit 7d95fb7 has to be reverted to make the card work.
Comment 8 James 2018-08-17 14:36:40 UTC
I tried backporting commit 7d95fb7 to a 4.9 kernel (just out of curiosity) and had the same problem as sergey79@gmail.com - my DVBSky S960 device did not work whatsoever!
Comment 9 Bernd Kuhls 2018-09-09 18:33:41 UTC
(In reply to sergey79 from comment #7)
> In kernel 4.18.0 dvbsky.c driver is completely broken - commit 7d95fb7 has
> to be reverted to make the card work.

Confirmed, same problem with 4.18.7, fixed by reverting 7d95fb7.
Comment 10 Oliver Freyermuth 2018-09-24 20:21:07 UTC
Same here, commit 7d95fb7 hardbreaks the card :-(.
Comment 11 Oliver Freyermuth 2018-10-02 21:18:25 UTC
I tried my luck on the mailing list, to no avail:
https://www.spinics.net/lists/linux-media/msg140839.html
https://www.spinics.net/lists/linux-media/msg141067.html
What's the way to contact the linux-media developers? 
It seems both bugzilla and mailing list are not monitored.
Comment 12 sergey79 2018-10-03 13:53:12 UTC
The tuning/mutex issue that was introduced with kernel 4.10 seem to be fixed in current kernel 4.19. Can anyone confirm? Commit 7d95fb7 still has to be reverted though.
Comment 13 James 2018-10-03 15:35:59 UTC
(In reply to Oliver Freyermuth from comment #11)
> I tried my luck on the mailing list, to no avail:
> https://www.spinics.net/lists/linux-media/msg140839.html
> https://www.spinics.net/lists/linux-media/msg141067.html
> What's the way to contact the linux-media developers? 
> It seems both bugzilla and mailing list are not monitored.

I found this thread, which is where the troublesome commit 7d95fb7 was worked on:
https://www.spinics.net/lists/linux-media/msg133995.html

Not sure whether replying to that thread would get this issue on the developers radar.
Comment 14 peterz 2018-10-03 15:56:28 UTC
On Wed, Oct 03, 2018 at 03:35:59PM +0000, bugzilla-daemon@bugzilla.kernel.org wrote:

> Not sure whether replying to that thread would get this issue on the
> developers
> radar.

Going by this one:

  https://www.spinics.net/lists/linux-media/msg133161.html

Since you all seem to be able to reproduce, could you try with something
like the below on top? It is unsafe, it can crash your kernel (if it
compiles at all of course), but if it runs it should give us some clues
as to who is holding the mutex and how long it takes to acquire it.

Output should end up in: /sys/kernel/debug/tracing/trace

---
diff --git a/drivers/media/usb/dvb-usb-v2/dvbsky.c b/drivers/media/usb/dvb-usb-v2/dvbsky.c
index 1aa88d94e57f..ccbd80c1ae0b 100644
--- a/drivers/media/usb/dvb-usb-v2/dvbsky.c
+++ b/drivers/media/usb/dvb-usb-v2/dvbsky.c
@@ -51,7 +51,12 @@ static int dvbsky_usb_generic_rw(struct dvb_usb_device *d,
 	int ret;
 	struct dvbsky_state *state = d_to_priv(d);
 
-	mutex_lock(&d->usb_mutex);
+	if (!mutex_trylock(&d->usb_mutex)) {
+		struct task_struct *p = __mutex_owner(&d->usb_mutex);
+		trace_printk("usb_mutex owned by: %d/%s\n", p->pid, p->comm);
+		mutex_lock(&d->usb_mutex);
+		trace_printk("usb_mutex acquired\n");
+	}
 	if (wlen != 0)
 		memcpy(state->obuf, wbuf, wlen);
Comment 15 Oliver Freyermuth 2018-10-03 18:58:04 UTC
(In reply to peterz from comment #14)
> Since you all seem to be able to reproduce, could you try with something
> like the below on top? 

Sorry to ask before trying it out, but - on top of a "vanilla" kernel or after reverting commit 7d95fb7 ? 

If I read 7d95fb7 correctly, it removed the call to dvbsky_usb_generic_rw from the stream_ctrl code, and essentially keeps the usb_mutex locked for the time the stream_mutex was locked previously. 
I must even wonder if the issue caused by 7d95fb7 is a mutex issue, or whether the difference is that dvbsky_usb_generic_rw always wrote to the "global" state-buffers obuf and ibuf, which dvb_usbv2_generic_rw_locked naturally does not do anymore.
Comment 16 Oliver Freyermuth 2018-10-03 20:47:38 UTC
(In reply to peterz from comment #14)
> Since you all seem to be able to reproduce, could you try with something
> like the below on top? 
> 
> Output should end up in: /sys/kernel/debug/tracing/trace
> 

Ok, tested that patch against a "vanilla" kernel 4.18, i.e. with 7d95fb7 applied. 
Card does not tune, as is the case whenever 7d95fb7 is applied. Nothing in trace log, nothing in kernel log. Do I need to activate tracing somehow? 
Sorry for my ignorance in case that's obvious :-(.
Comment 17 peterz 2018-10-04 07:36:15 UTC
On Wed, Oct 03, 2018 at 08:47:38PM +0000, bugzilla-daemon@bugzilla.kernel.org wrote:
> https://bugzilla.kernel.org/show_bug.cgi?id=199323
> 
> --- Comment #16 from Oliver Freyermuth (o.freyermuth@googlemail.com) ---
> (In reply to peterz from comment #14)
> > Since you all seem to be able to reproduce, could you try with something
> > like the below on top? 
> > 
> > Output should end up in: /sys/kernel/debug/tracing/trace
> > 
> 
> Ok, tested that patch against a "vanilla" kernel 4.18, i.e. with 7d95fb7
> applied. 
> Card does not tune, as is the case whenever 7d95fb7 is applied. Nothing in
> trace log, nothing in kernel log. Do I need to activate tracing somehow? 
> Sorry for my ignorance in case that's obvious :-(.

Hurm.. IIRC all you need is CONFIG_TRACING=y for trace_printk() to work.
So if you have the ftrace debugfs stuff, it _should_ all work.

Also, you should get a big 'scary' message in dmesg when trace_printk()
is active, that's usually a good sign that it works.

And yes, on top of vanilla, as we want to know what is preventing things
from working there.
Comment 18 Oliver Freyermuth 2018-10-04 14:37:33 UTC
(In reply to peterz from comment #17)
> Hurm.. IIRC all you need is CONFIG_TRACING=y for trace_printk() to work.
> So if you have the ftrace debugfs stuff, it _should_ all work.
> 
CONFIG_TRACING=y
CONFIG_TRACING_SUPPORT=y
are on, and /sys/kernel/debug/tracing/tracing_on contains "1". 
Sadly, no output, and no scary message in dmesg. 

> Also, you should get a big 'scary' message in dmesg when trace_printk()
> is active, that's usually a good sign that it works.
> 
> And yes, on top of vanilla, as we want to know what is preventing things
> from working there.

Let's presume it's something we won't see this way. For example, what about the changed timing? 
Before 7d95fb7, the usb-mutex was locked and unlocked twice in dvbsky_stream_ctrl via the call to dvbsky_usb_generic_rw, in addition to locking the stream-mutex. Now, it's locked only once. That should change the timing between the two calls, no? 
Since I have no experience with kernel programming, I don't have a feeling how much it will affect the timing, though. Maybe it's insignificant? 

Any other idea on how to debug it?
Comment 19 Oliver Freyermuth 2018-10-04 17:57:52 UTC
Argh! My bad. I did only "modules_install" and presumed it would depend on "modules", but it does not :-(.
Retrying again now, let's hope we get trace output now.
Comment 20 Oliver Freyermuth 2018-10-04 18:11:33 UTC
Ok, that did the trick. Scary message seen. 

This is what I got in trace:

#           TASK-PID   CPU#  ||||    TIMESTAMP  FUNCTION
#              | |       |   ||||       |         |
     kworker/1:1-67    [001] ....   963.283565: dvbsky_usb_generic_rw: usb_mutex owned by: 8585/kdvb-ad-0-fe-0
  kdvb-ad-0-fe-0-8585  [000] ....   963.287983: dvbsky_usb_generic_rw: usb_mutex owned by: 67/kworker/1:1
     kworker/1:1-67    [001] ....   963.287990: dvbsky_usb_generic_rw: usb_mutex acquired
  kdvb-ad-0-fe-0-8585  [000] ....   963.288234: dvbsky_usb_generic_rw: usb_mutex acquired
     kworker/1:1-67    [001] ....   963.595574: dvbsky_usb_generic_rw: usb_mutex owned by: 8585/kdvb-ad-0-fe-0
  kdvb-ad-0-fe-0-8585  [000] ....   963.600189: dvbsky_usb_generic_rw: usb_mutex owned by: 67/kworker/1:1
     kworker/1:1-67    [001] ....   963.600205: dvbsky_usb_generic_rw: usb_mutex acquired
  kdvb-ad-0-fe-0-8585  [000] ....   963.600367: dvbsky_usb_generic_rw: usb_mutex acquired
     kworker/1:1-67    [001] ....   963.907570: dvbsky_usb_generic_rw: usb_mutex owned by: 8585/kdvb-ad-0-fe-0
  kdvb-ad-0-fe-0-8585  [000] ....   963.912435: dvbsky_usb_generic_rw: usb_mutex owned by: 67/kworker/1:1
     kworker/1:1-67    [001] ....   963.912454: dvbsky_usb_generic_rw: usb_mutex acquired
  kdvb-ad-0-fe-0-8585  [000] ....   963.912690: dvbsky_usb_generic_rw: usb_mutex acquired
     kworker/1:1-67    [001] ....   964.523534: dvbsky_usb_generic_rw: usb_mutex owned by: 8604/kdvb-ad-0-fe-0
     kworker/1:1-67    [001] ....   964.524490: dvbsky_usb_generic_rw: usb_mutex acquired
  kdvb-ad-0-fe-0-8604  [005] ....   964.524499: dvbsky_usb_generic_rw: usb_mutex owned by: 67/kworker/1:1
  kdvb-ad-0-fe-0-8604  [005] ....   964.524635: dvbsky_usb_generic_rw: usb_mutex acquired
     kworker/1:1-67    [001] ....   987.930957: dvbsky_usb_generic_rw: usb_mutex owned by: 8584/w_scan
     kworker/1:1-67    [001] ....   987.931394: dvbsky_usb_generic_rw: usb_mutex acquired

This was when running a scan with w_scan. Does that help?
Comment 21 Oliver Freyermuth 2018-10-05 15:38:27 UTC
Mauro has just submitted the revert of 7d95fb7 upstream, including 4.18 stable. 

Apparently, this caused similar issues on RPi, even though in my case it was "standard" Intel x86_64 hardware with an xhci host controller. He presumed the failure might have been caused by changes in the USB subsystem, since the patch initially worked for him. 

I'll add links to the mail thread as soon as the mails made it to the archive at spinics later.
Comment 22 DocMAX 2018-10-05 16:17:57 UTC
That took long. I notified him on IRC some 2 months ago about this issue.
And if this is fix, there is still another older bug present which lets the device stop working after some hours:

https://forum.libreelec.tv/thread/12452-problems-with-dvbsky-technotrend-usb-dvb-s2-tuners-since-le8/

I can still confirm this bug.
Comment 23 sergey79 2018-10-05 16:58:03 UTC
(In reply to DocMAX from comment #22)
> there is still another older bug present which lets the
> device stop working after some hours

This seem to be fixed in kernel 4.19 (probably with latest mutex.c commits). I'm running 4.19-rc6 and I was unable to reproduce that issue (it was easy to reproduce with 4.14 and 4.18).
Comment 24 Oliver Freyermuth 2018-10-05 21:06:44 UTC
Here the promised link to the mail archives:
https://www.spinics.net/lists/linux-media/msg141418.html
Comment 25 sergey79 2018-10-08 17:43:36 UTC
(In reply to sergey79 from comment #23)
> This seem to be fixed in kernel 4.19 (probably with latest mutex.c commits).
> I'm running 4.19-rc6 and I was unable to reproduce that issue (it was easy
> to reproduce with 4.14 and 4.18).

I spoke too soon. Today it stopped working after 3 days. Going back to kernel 4.9...
Comment 26 James 2018-10-12 10:27:40 UTC
The debug tracing attached to this thread suggests that the device malfunctions shortly after receiving an extra {10} message in-between the {37 00 00}, {36 03 00} message pair:

dvb_usb_v2:dvb_usb_v2_generic_io: usb 1-1: dvb_usb_v2_generic_io: >>> 37 00 00
dvb_usb_v2:dvb_usb_v2_generic_io: usb 1-1: dvb_usb_v2_generic_io: >>> 10
dvb_usb_v2:dvb_usb_v2_generic_io: usb 1-1: dvb_usb_v2_generic_io: <<< ff ff
dvb_usb_v2:dvb_usb_v2_generic_io: usb 1-1: dvb_usb_v2_generic_io: >>> 36 03 00
...
m88ds3103:m88ds3103_diseqc_send_master_cmd: m88ds3103 4-0068: diseqc tx timeout
m88ds3103:m88ds3103_diseqc_send_master_cmd: m88ds3103 4-0068: failed=-110

Looks to me like this is the bug Mauro was trying to resolve in 7d95fb7, i.e. by removing stream_mutex, and using only the usb_mutex inside dvbsky_stream_ctrl i.e. the same mutex used elsewhere in the driver, to ensure correct serialization.

However, as we know, this patch appears to render the device unusable for several users including myself.

I've done a couple of minor re-writes to 7d95fb7; most notably so it uses the state buffers rather than passing along a static variable.

So far, I've only managed to verify that the device works correctly with this patch applied. I performed several scans & channel tuning attempts and everything worked as normal. 

However, it can take several days/weeks before the -110 timeout errors occur on my x86_64 machine, so I would appreciate if others who are able to reproduce the issue more readily could test the patch attached to this thread.

Note: This should be applied after reverting 7d95fb7.
Comment 27 James 2018-10-12 10:30:17 UTC
Created attachment 279005 [details]
media: dvbsky: use a single mutex and state buffers for all R/W ops
Comment 28 sergey79 2018-10-12 23:41:37 UTC
(In reply to James from comment #27)
> Created attachment 279005 [details]
> media: dvbsky: use a single mutex and state buffers for all R/W ops

That patch didn't fix the issue for me (I also reported this on libreelec forum). The quickest way to reproduce this is to try to tune to transponders when there is no signal from LNB. Otherwise it could take days.
Comment 29 DocMAX 2018-11-01 20:47:16 UTC
So if there is no fix for the mutex thing, how can i fix the stop working problem?
If i plug out and in the usb cable the device works.
So is a "usb port reset" possible on command line whenever the device freezes?
Comment 30 DocMAX 2018-11-17 07:42:19 UTC
OK, finally, at least on Arch-Linux, it's fixed with kernel 4.19.2.
Comment 31 sergey79 2018-12-08 01:35:32 UTC
Created attachment 279899 [details]
mutex fix
Comment 32 sergey79 2018-12-08 01:37:54 UTC
Comment on attachment 279899 [details]
mutex fix

I found a solution for the mutex issue.
Comment 33 sergey79 2018-12-08 02:15:42 UTC
After applying that patch the random tuner failures are gone. Tested with DVBSky S960, kernel 4.19.7.
Comment 34 Guido Aulisi 2018-12-10 09:04:10 UTC
(In reply to sergey79 from comment #32)
> Comment on attachment 279899 [details]
> mutex fix
> 
> I found a solution for the mutex issue.

IMHO that patch isn't an acceptable solution, it disables lock handoff (commit 9d659ae14b545c4296e812c70493bfdc999b5c1c), which I think is not acceptable upstream.

It was a fast way to revert that commit (see [0]) and test if it's still causing [1].
I think using single mutex was the right way to patch all handoff issues, but it seems not to work here.

[0]: https://bugzilla.kernel.org/show_bug.cgi?id=195471#c27
[1]: https://bugzilla.kernel.org/show_bug.cgi?id=195471
Comment 35 sergey79 2018-12-10 14:08:21 UTC
(In reply to Guido Aulisi from comment #34)
> IMHO that patch isn't an acceptable solution

I agree but it's better than nothing. At least I can use new kernels now.


> I think using single mutex was the right way to patch all handoff issues,
> but it seems not to work here.

That is right, the original "single mutex" commit that appeared in kernel 4.18 completely broke the driver. With the patch from James (media: dvbsky: use a single mutex and state buffers for all R/W ops) the driver works but it didn't fix the original mutex issue.
Comment 36 Bernd Kuhls 2018-12-22 22:46:13 UTC
Created attachment 280127 [details]
syslog before reboot

(In reply to sergey79 from comment #32)
> Comment on attachment 279899 [details]
> mutex fix
> 
> I found a solution for the mutex issue.

While this patch overall worked great during the last days the machine rebooted a couple of minutes ago, syslog is attached, kernel 4.19.12 x86_64.
Comment 37 sergey79 2018-12-23 13:58:21 UTC
(In reply to Bernd Kuhls from comment #36)
> While this patch overall worked great during the last days the machine
> rebooted a couple of minutes ago, syslog is attached, kernel 4.19.12 x86_64.

I'm using kernel 4.19 with this patch for 2 weeks, running VDR server 24/7. No issues so far.
Comment 38 James 2019-01-11 14:12:58 UTC
Created attachment 280407 [details]
m88ds3103-serialize-reset-messages.patch

I managed to find some time to run this issue to ground, and have developed the following patch which should resolve issue introduced by the following commit:

9d659ae: ("locking/mutex: Add lock handoff to avoid starvation")

Would be interested to know whether this patch gives others the same success it gives me, before I submit for upstream consideration.

Whilst debugging, I also spotted a race condition in the driver which has been around for a few years, but that's a separate issue for another day and another thread. In the meantime, this should hopefully get the driver as stable as before commit 9d659ae.
Comment 39 sergey79 2019-01-11 17:30:57 UTC
(In reply to James from comment #38)
> Created attachment 280407 [details]
> m88ds3103-serialize-reset-messages.patch

I tested your patch with kernel 4.19.14 and I can confirm that it fixed the issue!
Comment 40 Oliver Freyermuth 2019-01-13 20:32:25 UTC
(In reply to James from comment #38)
> Created attachment 280407 [details]
> m88ds3103-serialize-reset-messages.patch

Same here, while sometimes I even had failure on the first tune, it now seems back to rock stability. 
Tested with 4.20.1. 

Thanks a lot!
Comment 41 James 2019-01-14 08:32:25 UTC
(In reply to Oliver Freyermuth from comment #40)
> (In reply to James from comment #38)
> > Created attachment 280407 [details]
> > m88ds3103-serialize-reset-messages.patch
> 
> Same here, while sometimes I even had failure on the first tune, it now
> seems back to rock stability. 
> Tested with 4.20.1. 
> 
> Thanks a lot!

Thanks for the confirmation.

I have submitted the patch for the consideration of the upstream devs:
https://patchwork.kernel.org/patch/10761423/
Comment 42 Bernd Kuhls 2019-01-17 19:39:36 UTC
(In reply to James from comment #38)
> Created attachment 280407 [details]
> m88ds3103-serialize-reset-messages.patch

Tested with 4.20.1 and had no issues in the last four days.
Comment 43 Stefan Brüns 2019-01-19 23:30:06 UTC
(In reply to James from comment #26)
> The debug tracing attached to this thread suggests that the device
> malfunctions shortly after receiving an extra {10} message in-between the
> {37 00 00}, {36 03 00} message pair:
> 
> dvb_usb_v2:dvb_usb_v2_generic_io: usb 1-1: dvb_usb_v2_generic_io: >>> 37 00
> 00
> dvb_usb_v2:dvb_usb_v2_generic_io: usb 1-1: dvb_usb_v2_generic_io: >>> 10
> dvb_usb_v2:dvb_usb_v2_generic_io: usb 1-1: dvb_usb_v2_generic_io: <<< ff ff
> dvb_usb_v2:dvb_usb_v2_generic_io: usb 1-1: dvb_usb_v2_generic_io: >>> 36 03
> 00

"10" is the remote control poll. Would be interesting if this still works after Un-reverting and using "options dvb_usb_dvbsky disable_rc=1" ...
Comment 44 James 2019-01-20 09:02:41 UTC
(In reply to Stefan Brüns from comment #43)
> (In reply to James from comment #26)
> > The debug tracing attached to this thread suggests that the device
> > malfunctions shortly after receiving an extra {10} message in-between the
> > {37 00 00}, {36 03 00} message pair:
> > 
> > dvb_usb_v2:dvb_usb_v2_generic_io: usb 1-1: dvb_usb_v2_generic_io: >>> 37 00
> > 00
> > dvb_usb_v2:dvb_usb_v2_generic_io: usb 1-1: dvb_usb_v2_generic_io: >>> 10
> > dvb_usb_v2:dvb_usb_v2_generic_io: usb 1-1: dvb_usb_v2_generic_io: <<< ff ff
> > dvb_usb_v2:dvb_usb_v2_generic_io: usb 1-1: dvb_usb_v2_generic_io: >>> 36 03
> > 00
> 
> "10" is the remote control poll. Would be interesting if this still works
> after Un-reverting and using "options dvb_usb_dvbsky disable_rc=1" ...

I had also spotted this and did indeed try setting disable_rc=1 which did not fix the issue.

The capture you are looking at relates to my earlier patch attempt to remove the stream_mutex, and use only the usb_mutex; which proved not to be the issue.

The real issue is described in the comments I included in the patch, where i also included a capture that is more accurately indicitive of the problem....

i.e.

In the following capture:
- thread #1325 is running m88ds3103_set_frontend
- thread #42 is running ts2020_stat_work

 a> [1325] usb 1-1: dvb_usb_v2_generic_io: >>> 08 68 02 07 80
    [1325] usb 1-1: dvb_usb_v2_generic_io: <<< 08
    [42] usb 1-1: dvb_usb_v2_generic_io: >>> 09 01 01 68 3f
    [42] usb 1-1: dvb_usb_v2_generic_io: <<< 08 ff
    [42] usb 1-1: dvb_usb_v2_generic_io: >>> 08 68 02 03 11
    [42] usb 1-1: dvb_usb_v2_generic_io: <<< 07
    [42] usb 1-1: dvb_usb_v2_generic_io: >>> 09 01 01 60 3d
    [42] usb 1-1: dvb_usb_v2_generic_io: <<< 07 ff
 b> [1325] usb 1-1: dvb_usb_v2_generic_io: >>> 08 68 02 07 00
    [1325] usb 1-1: dvb_usb_v2_generic_io: <<< 07
    [42] usb 1-1: dvb_usb_v2_generic_io: >>> 08 68 02 03 11
    [42] usb 1-1: dvb_usb_v2_generic_io: <<< 07
    [42] usb 1-1: dvb_usb_v2_generic_io: >>> 09 01 01 60 21
    [42] usb 1-1: dvb_usb_v2_generic_io: <<< 07 ff
    [42] usb 1-1: dvb_usb_v2_generic_io: >>> 08 68 02 03 11
    [42] usb 1-1: dvb_usb_v2_generic_io: <<< 07
    [42] usb 1-1: dvb_usb_v2_generic_io: >>> 09 01 01 60 66
    [42] usb 1-1: dvb_usb_v2_generic_io: <<< 07 ff
    [1325] usb 1-1: dvb_usb_v2_generic_io: >>> 08 68 02 03 11
    [1325] usb 1-1: dvb_usb_v2_generic_io: <<< 07
    [1325] usb 1-1: dvb_usb_v2_generic_io: >>> 08 60 02 10 0b
    [1325] usb 1-1: dvb_usb_v2_generic_io: <<< 07 

Two i2c messages are sent to perform a reset in m88ds3103_set_frontend:

a. 0x07, 0x80
b. 0x07, 0x00

However, as shown in the capture, the regmap mutex is being handed over to another thread (ts2020_stat_work in this example) in between these two messages.

From here, the device responds to every i2c message with an 07 message, and will only return to normal operation following a power cycle.

I managed to capture the issue several times with similar debug traces to that above. They all followed this same pattern.

The ts2020_stat_work thread wakes up every 2000ms to gather signal strength stats from the device. From my debug captures i found that the problem occurs if that tread happens to wake up whilst another thread is in m88ds3103_set_frontend (per example above).

The patch resolves this by using regmap_multi_reg_write to group the two reset messages, ensuring both are processed before the regmap mutex/i2c gate is unlocked.