Bug 206403

Summary: surface3-spi (touchscreen driver): crash after two touch input
Product: Drivers Reporter: Tsuchiya Yuto (kitakar5525) (kitakar)
Component: Input DevicesAssignee: Jarkko Nikula (jarkko.nikula)
Status: CLOSED CODE_FIX    
Severity: normal CC: andy.shevchenko
Priority: P2    
Hardware: x86-64   
OS: Linux   
Kernel Version: 5.5.1 Subsystem:
Regression: No Bisected commit-id:
Attachments: surface3-spi: workaround: prevent driver from using DMA
reloading only spi_pxa2xx_platform module after the touch input crash causes NULL pointer dereference

Description Tsuchiya Yuto (kitakar5525) 2020-02-04 14:07:17 UTC
Created attachment 287105 [details]
surface3-spi: workaround: prevent driver from using DMA

Overview:
On kernels at least after 4.19, touch input is broken after suspend
(s2idle) then making touch input two times:

        kern  :err   : [  +0.203408] Surface3-spi spi-MSHW0037:00: SPI transfer timed out

On recent stable kernels at least after 5.1, touch input will crash after
making touch input two times:

        kern  :err   : [  +0.203592] Surface3-spi spi-MSHW0037:00: SPI transfer timed out
        kern  :err   : [  +0.000173] spi_master spi1: failed to transfer one message from queue

Steps to Reproduce:
On 4.19.y:
1) boot the system
2) go into s2idle
3) wakeup from s2idle
4) make the first touch input (by both finger or stylus)
5) try to make the second touch input

On 5.5.y:
1) boot the system
2) make the first touch input
3) try to make the second touch input

Actual Results:
Touch input will no longer usable with the aforementioned dmesg log.

Expected Results:
Touch input will still usable after the second touch.

Additional Information:
The touchscreen driver works as expected with PIO. Only broken with
DMA. So, I made a patch to prevent the driver from using DMA (I attached
the patch for reference).

Note that in some kernel config setups, the driver uses PIO by default.
In other kernel config setups, on the other hand, the driver uses DMA
by default.

I confirmed that at least the following distribution kernels with their
config are affected:
- Fedora (5.4.12-200.fc31.x86_64)
- Arch Linux (both linux (5.5.y) and linux-lts (4.19.y) kernels
- Ubuntu (5.3.0-26-generic)

So, questions are:
- Is the driver intended for use with DMA?
- If not, is it possible to make the driver usable with DMA?
Comment 1 Tsuchiya Yuto (kitakar5525) 2020-02-04 14:49:01 UTC
Created attachment 287107 [details]
reloading only spi_pxa2xx_platform module after the touch input crash causes NULL pointer dereference

By the way, interestingly, I can fix the touch input crash with DMA mode
by reloading only spi_pxa2xx_platform module after (and only after) the
crash (with causing kernel NULL pointer dereference, race at shutting
down the spi_pxa2xx_platform module? [1]).

        sudo modprobe -r spi_pxa2xx_platform
        sudo modprobe spi_pxa2xx_platform

When I reload spi_pxa2xx_platform module with surface3_spi like this
(after the touch input crash), I can avoid the kernel NULL pointer 
dereference, but the touch input crash is still reproducible after this
reload.

        sudo modprobe -r surface3_spi
        sudo modprobe -r spi_pxa2xx_platform
        sudo modprobe spi_pxa2xx_platform
        sudo modprobe surface3_spi

[1] Have a look at the work that is done so far: https://github.com/linux-surface/kernel/commit/66c840113c7f7c7ffcaf5e4809ed1074c550ec0b
     I attached the dmesg log here, too.
Comment 2 Andy Shevchenko 2020-02-05 13:52:55 UTC
Reassigned to Jarrko and downgrade priority (it doesn't prevent most of the system working). Jarkko, I have my patch in incubation here: https://github.com/andy-shev/linux/tree/topic/spi/reload, I think I almost nailed it down, need more time to investigate. But it is being developed in order to solve second issue (removal crash), and not main part.
Comment 3 Tsuchiya Yuto (kitakar5525) 2020-02-08 23:19:44 UTC
Changed title to describe more precise situation.
Comment 4 Tsuchiya Yuto (kitakar5525) 2020-05-29 13:36:31 UTC
From 930f5d85e3fbf1c47cede7319cc873b49963a895 Mon Sep 17 00:00:00 2001
From: "Tsuchiya Yuto" <kitakar@gmail.com>
Date: Fri, 29 May 2020 18:15:01 +0900
Subject: [PATCH] spi: pxa2xx: release/setup DMA on runtime_pm

On recent v4.19 series (confirmed on v4.19.125), surface3_spi driver
stops working after resuming from s2idle then making a touch input:

    kern  :err   : [  +0.203408] Surface3-spi spi-MSHW0037:00: SPI transfer timed out

On recent v5.4, v5.6, and v5.7-rc series (confirmed on v5.4.43, v5.6.15,
and v5.7-rc7), that driver stops working after putting (internal)
display off/on then making a touch input. (On earlier versions, it stopped
after making touch input two times, regardless of s2idle/display state)

    kern  :err   : [  +0.203592] Surface3-spi spi-MSHW0037:00: SPI transfer timed out
    kern  :err   : [  +0.000173] spi_master spi1: failed to transfer one message from queue

This issue happens only with DMA mode.

I found that release/setup DMA cycle helps to get that driver working
again with DMA mode but I'm not sure where is the proper place to do so.
So, for now, I put changes into runtime_suspend()/runtime_resume() because
those gets called on every display off/on.

This commit release/setup DMA on runtime_suspend()/runtime_resume() to
avoid surface3_spi driver crashing.

Note: Considering that this issue is reproducible by only display off/on,
the cause of this issue is not related to s2idle. I also tried disabling
runtime_pm by commenting out the following lines [drivers/spi/spi-pxa2xx.c]

		SET_RUNTIME_PM_OPS(pxa2xx_spi_runtime_suspend,
				pxa2xx_spi_runtime_resume, NULL)

but still reproducible. (surface3_spi doesn't use runtime_pm). So,
runtime_pm is also not relevant.

TODO: Can someone guess why this is needed for Surface 3 (or surface3_spi
driver) ? I don't even know where the root cause lies; who/where to ask?
maybe the cause is not in surface3_spi or spi-pxa2xx code.

Signed-off-by: Tsuchiya Yuto <kitakar@gmail.com>
---
 drivers/spi/spi-pxa2xx.c | 24 ++++++++++++++++++++++++
 1 file changed, 24 insertions(+)

diff --git a/drivers/spi/spi-pxa2xx.c b/drivers/spi/spi-pxa2xx.c
index f6e87344a36c8..d8d03dc5ac9c0 100644
--- a/drivers/spi/spi-pxa2xx.c
+++ b/drivers/spi/spi-pxa2xx.c
@@ -1978,6 +1978,13 @@ static int pxa2xx_spi_runtime_suspend(struct device *dev)
 {
 	struct driver_data *drv_data = dev_get_drvdata(dev);
 
+	pr_info("DEBUG: pxa2xx_spi_runtime_suspend() called\n");
+
+	/* Release DMA.
+	 * Code is extracted from pxa2xx_spi_remove(). */
+	if (drv_data->controller_info->enable_dma)
+		pxa2xx_spi_dma_release(drv_data);
+
 	clk_disable_unprepare(drv_data->ssp->clk);
 	return 0;
 }
@@ -1987,6 +1994,23 @@ static int pxa2xx_spi_runtime_resume(struct device *dev)
 	struct driver_data *drv_data = dev_get_drvdata(dev);
 	int status;
 
+	pr_info("DEBUG: pxa2xx_spi_runtime_resume() called\n");
+
+	/* Setup DMA if requested.
+	 * Code is extracted from pxa2xx_spi_probe(). */
+	if (drv_data->controller_info) {
+		status = pxa2xx_spi_dma_setup(drv_data);
+		if (status) {
+			dev_warn(dev, "no DMA channels available, using PIO\n");
+			drv_data->controller_info = false;
+		} else {
+			drv_data->controller->can_dma = pxa2xx_spi_can_dma;
+			drv_data->controller->max_dma_len = MAX_DMA_LEN;
+			drv_data->controller->max_transfer_size =
+				pxa2xx_spi_max_dma_transfer_size;
+		}
+	}
+
 	status = clk_prepare_enable(drv_data->ssp->clk);
 	return status;
 }
-- 
2.26.2
Comment 5 Andy Shevchenko 2020-05-29 14:29:12 UTC
(In reply to Tsuchiya Yuto (kitakar5525) from comment #4)

...

> This issue happens only with DMA mode.
> 
> I found that release/setup DMA cycle helps to get that driver working
> again with DMA mode but I'm not sure where is the proper place to do so.
> So, for now, I put changes into runtime_suspend()/runtime_resume() because
> those gets called on every display off/on.
> 
> This commit release/setup DMA on runtime_suspend()/runtime_resume() to
> avoid surface3_spi driver crashing.

...

> TODO: Can someone guess why this is needed for Surface 3 (or surface3_spi
> driver) ? I don't even know where the root cause lies; who/where to ask?
> maybe the cause is not in surface3_spi or spi-pxa2xx code.

> +     pr_info("DEBUG: pxa2xx_spi_runtime_suspend() called\n");
> +
> +     /* Release DMA.
> +      * Code is extracted from pxa2xx_spi_remove(). */
> +     if (drv_data->controller_info->enable_dma)
> +             pxa2xx_spi_dma_release(drv_data);
> +
>       clk_disable_unprepare(drv_data->ssp->clk);

Ah, looks that's again related to the wrong state machine implementation / handling.

Actually doesn't simple call of dma_terminate_sync() help with this? (we have function called pxa2xx_spi_dma_stop() for that)
Comment 6 Tsuchiya Yuto (kitakar5525) 2020-05-29 15:03:43 UTC
(In reply to Andy Shevchenko from comment #5)
> (In reply to Tsuchiya Yuto (kitakar5525) from comment #4)
> 
> ...
> 
> > This issue happens only with DMA mode.
> > 
> > I found that release/setup DMA cycle helps to get that driver working
> > again with DMA mode but I'm not sure where is the proper place to do so.
> > So, for now, I put changes into runtime_suspend()/runtime_resume() because
> > those gets called on every display off/on.

Oh, what I've said is incorrect. runtime_suspend()/runtime_resume() gets
called rather on every touch input stop/start.

> > This commit release/setup DMA on runtime_suspend()/runtime_resume() to
> > avoid surface3_spi driver crashing.
> 
> ...
> 
> > TODO: Can someone guess why this is needed for Surface 3 (or surface3_spi
> > driver) ? I don't even know where the root cause lies; who/where to ask?
> > maybe the cause is not in surface3_spi or spi-pxa2xx code.
> 
> > +     pr_info("DEBUG: pxa2xx_spi_runtime_suspend() called\n");
> > +
> > +     /* Release DMA.
> > +      * Code is extracted from pxa2xx_spi_remove(). */
> > +     if (drv_data->controller_info->enable_dma)
> > +             pxa2xx_spi_dma_release(drv_data);
> > +
> >       clk_disable_unprepare(drv_data->ssp->clk);
> 
> Ah, looks that's again related to the wrong state machine implementation /
> handling.
> 
> Actually doesn't simple call of dma_terminate_sync() help with this? (we
> have function called pxa2xx_spi_dma_stop() for that)

Like this?

diff --git a/drivers/spi/spi-pxa2xx.c b/drivers/spi/spi-pxa2xx.c
index 73d2a65d0b6ef..c8d305cc9a373 100644
--- a/drivers/spi/spi-pxa2xx.c
+++ b/drivers/spi/spi-pxa2xx.c
@@ -1977,6 +1977,11 @@ static int pxa2xx_spi_runtime_suspend(struct device *dev)
 {
 	struct driver_data *drv_data = dev_get_drvdata(dev);
 
+	/* Stop the DMA if running.
+	 * Code is extracted from pxa2xx_spi_handle_err(). */
+	if (atomic_read(&drv_data->dma_running))
+		pxa2xx_spi_dma_stop(drv_data);
+
 	clk_disable_unprepare(drv_data->ssp->clk);
 	return 0;
 }

but this didn't work. It seems that we really need to release DMA
(dma_release_channel()) and setup DMA again.
Comment 7 Tsuchiya Yuto (kitakar5525) 2020-05-29 15:16:39 UTC
(In reply to Tsuchiya Yuto (kitakar5525) from comment #6)
> (In reply to Andy Shevchenko from comment #5)
> > (In reply to Tsuchiya Yuto (kitakar5525) from comment #4)
> > > TODO: Can someone guess why this is needed for Surface 3 (or surface3_spi
> > > driver) ? I don't even know where the root cause lies; who/where to ask?
> > > maybe the cause is not in surface3_spi or spi-pxa2xx code.
> > 
> > > +     pr_info("DEBUG: pxa2xx_spi_runtime_suspend() called\n");
> > > +
> > > +     /* Release DMA.
> > > +      * Code is extracted from pxa2xx_spi_remove(). */
> > > +     if (drv_data->controller_info->enable_dma)
> > > +             pxa2xx_spi_dma_release(drv_data);
> > > +
> > >       clk_disable_unprepare(drv_data->ssp->clk);
> > 
> > Ah, looks that's again related to the wrong state machine implementation /
> > handling.
> > 
> > Actually doesn't simple call of dma_terminate_sync() help with this? (we
> > have function called pxa2xx_spi_dma_stop() for that)
> 
> Like this?
> 
> diff --git a/drivers/spi/spi-pxa2xx.c b/drivers/spi/spi-pxa2xx.c
> index 73d2a65d0b6ef..c8d305cc9a373 100644
> --- a/drivers/spi/spi-pxa2xx.c
> +++ b/drivers/spi/spi-pxa2xx.c
> @@ -1977,6 +1977,11 @@ static int pxa2xx_spi_runtime_suspend(struct device
> *dev)
>  {
>       struct driver_data *drv_data = dev_get_drvdata(dev);
>  
> +     /* Stop the DMA if running.
> +      * Code is extracted from pxa2xx_spi_handle_err(). */
> +     if (atomic_read(&drv_data->dma_running))
> +             pxa2xx_spi_dma_stop(drv_data);
> +
>       clk_disable_unprepare(drv_data->ssp->clk);
>       return 0;
>  }
> 
> but this didn't work. It seems that we really need to release DMA
> (dma_release_channel()) and setup DMA again.

Ah, that pxa2xx_spi_dma_stop() didn't get called. So, I tried the following

diff --git a/drivers/spi/spi-pxa2xx.c b/drivers/spi/spi-pxa2xx.c
index 73d2a65d0b6ef..a2b848b92b3ae 100644
--- a/drivers/spi/spi-pxa2xx.c
+++ b/drivers/spi/spi-pxa2xx.c
@@ -1977,6 +1977,9 @@ static int pxa2xx_spi_runtime_suspend(struct device *dev)
 {
 	struct driver_data *drv_data = dev_get_drvdata(dev);
 
+	pr_info("DEBUG: before pxa2xx_spi_dma_stop()\n");
+	pxa2xx_spi_dma_stop(drv_data);
+
 	clk_disable_unprepare(drv_data->ssp->clk);
 	return 0;
 }

but didn't work

@@ -1986,6 +1989,9 @@ static int pxa2xx_spi_runtime_resume(struct device *dev)
 	struct driver_data *drv_data = dev_get_drvdata(dev);
 	int status;
 
+	pr_info("DEBUG: before pxa2xx_spi_dma_start()\n");
+	pxa2xx_spi_dma_start(drv_data);
+
 	status = clk_prepare_enable(drv_data->ssp->clk);
 	return status;
 }

even if I add pxa2xx_spi_dma_start(), didn't work either.
Comment 8 Andy Shevchenko 2020-07-04 21:14:32 UTC
This gives me another clue. It may be that DMA controller lost his context and thus get's uninitialized / off.

Can you try two things (one-by-one and together) in drivers/dma/dw/core.c:

1/ comment out in  dwc_initialize()
        ... 
	if (test_bit(DW_DMA_IS_INITIALIZED, &dwc->flags))
		return;
        ...
2/ Append
  do_dw_dma_on(dw);
just before above mentioned lines.
Comment 9 Tsuchiya Yuto (kitakar5525) 2020-07-05 11:24:07 UTC
First, I built 5.8-rc3 kernel with the following config so that I can
rebuild only dw_dmac_core module:

    CONFIG_SERIAL_8250=m
    CONFIG_SERIAL_8250_LPSS=m
    CONFIG_DW_DMAC=m
    CONFIG_DW_DMAC_PCI=m

Then I confirmed that touch input still stops after display off when using
DMA mode with the kernel I built.

(In reply to Andy Shevchenko from comment #8)
> This gives me another clue. It may be that DMA controller lost his context
> and thus get's uninitialized / off.
> 
> Can you try two things (one-by-one and together) in drivers/dma/dw/core.c:
> 
> 1/ comment out in  dwc_initialize()
>         ... 
>       if (test_bit(DW_DMA_IS_INITIALIZED, &dwc->flags))
>               return;
>         ...
> 2/ Append
>   do_dw_dma_on(dw);
> just before above mentioned lines.

A) only 1/

Touch input still working with DMA mode after display_off/s2idle

B) only 2/

Touch input stopped working after display_off/s2idle

C) both 1/ and 2/

	[...]
	do_dw_dma_on(dw);

	// if (test_bit(DW_DMA_IS_INITIALIZED, &dwc->flags))
	// 	return;
	[...]

Touch input still working with DMA mode after display_off/s2idle

Thank you.
Comment 10 Andy Shevchenko 2020-07-09 15:32:43 UTC
https://git.kernel.org/pub/scm/linux/kernel/git/vkoul/dmaengine.git/commit/?id=99ba8b9b0d9780e9937eb1d488d120e9e5c2533d

AFAIU this should fix the issue. If not, feel free to reopen.
Comment 11 Tsuchiya Yuto (kitakar5525) 2020-07-11 10:09:37 UTC
Yes, I confirmed that the patch works for v5.8-rc3, also for v5.4.51 and
v4.19.132 (with resolving small conflict for v4.19).

I no longer see surface3-spi driver crashing after display_off/s2idle.

Closing.

Thanks!
Comment 12 Tsuchiya Yuto (kitakar5525) 2020-07-14 13:03:00 UTC
Just for your info, I found a similar commit in intel-aero/aero-4.4.x
kernel tree:

https://github.com/intel-aero/linux-kernel/commit/3f5fcff79779310cea9da21b3ad808334854f5c6
>    dma: dw: must initialize dma channel in each transaction
>
>    CHT enable Autonomous D3 for LPSS DMA. When idle, DMA will power gated
>    and reset without driver knowledge, So need to initialize in each dma
>    transaction.

The commit message describes exactly the same situation as observed here.
Comment 13 Andy Shevchenko 2020-07-14 13:11:36 UTC
(In reply to Tsuchiya Yuto (kitakar5525) from comment #12)
> Just for your info, I found a similar commit in intel-aero/aero-4.4.x
> kernel tree:
> 
> https://github.com/intel-aero/linux-kernel/commit/
> 3f5fcff79779310cea9da21b3ad808334854f5c6

Thanks, I was not aware that we have already something like this.