Created attachment 287105 [details] surface3-spi: workaround: prevent driver from using DMA Overview: On kernels at least after 4.19, touch input is broken after suspend (s2idle) then making touch input two times: kern :err : [ +0.203408] Surface3-spi spi-MSHW0037:00: SPI transfer timed out On recent stable kernels at least after 5.1, touch input will crash after making touch input two times: kern :err : [ +0.203592] Surface3-spi spi-MSHW0037:00: SPI transfer timed out kern :err : [ +0.000173] spi_master spi1: failed to transfer one message from queue Steps to Reproduce: On 4.19.y: 1) boot the system 2) go into s2idle 3) wakeup from s2idle 4) make the first touch input (by both finger or stylus) 5) try to make the second touch input On 5.5.y: 1) boot the system 2) make the first touch input 3) try to make the second touch input Actual Results: Touch input will no longer usable with the aforementioned dmesg log. Expected Results: Touch input will still usable after the second touch. Additional Information: The touchscreen driver works as expected with PIO. Only broken with DMA. So, I made a patch to prevent the driver from using DMA (I attached the patch for reference). Note that in some kernel config setups, the driver uses PIO by default. In other kernel config setups, on the other hand, the driver uses DMA by default. I confirmed that at least the following distribution kernels with their config are affected: - Fedora (5.4.12-200.fc31.x86_64) - Arch Linux (both linux (5.5.y) and linux-lts (4.19.y) kernels - Ubuntu (5.3.0-26-generic) So, questions are: - Is the driver intended for use with DMA? - If not, is it possible to make the driver usable with DMA?
Created attachment 287107 [details] reloading only spi_pxa2xx_platform module after the touch input crash causes NULL pointer dereference By the way, interestingly, I can fix the touch input crash with DMA mode by reloading only spi_pxa2xx_platform module after (and only after) the crash (with causing kernel NULL pointer dereference, race at shutting down the spi_pxa2xx_platform module? [1]). sudo modprobe -r spi_pxa2xx_platform sudo modprobe spi_pxa2xx_platform When I reload spi_pxa2xx_platform module with surface3_spi like this (after the touch input crash), I can avoid the kernel NULL pointer dereference, but the touch input crash is still reproducible after this reload. sudo modprobe -r surface3_spi sudo modprobe -r spi_pxa2xx_platform sudo modprobe spi_pxa2xx_platform sudo modprobe surface3_spi [1] Have a look at the work that is done so far: https://github.com/linux-surface/kernel/commit/66c840113c7f7c7ffcaf5e4809ed1074c550ec0b I attached the dmesg log here, too.
Reassigned to Jarrko and downgrade priority (it doesn't prevent most of the system working). Jarkko, I have my patch in incubation here: https://github.com/andy-shev/linux/tree/topic/spi/reload, I think I almost nailed it down, need more time to investigate. But it is being developed in order to solve second issue (removal crash), and not main part.
Changed title to describe more precise situation.
From 930f5d85e3fbf1c47cede7319cc873b49963a895 Mon Sep 17 00:00:00 2001 From: "Tsuchiya Yuto" <kitakar@gmail.com> Date: Fri, 29 May 2020 18:15:01 +0900 Subject: [PATCH] spi: pxa2xx: release/setup DMA on runtime_pm On recent v4.19 series (confirmed on v4.19.125), surface3_spi driver stops working after resuming from s2idle then making a touch input: kern :err : [ +0.203408] Surface3-spi spi-MSHW0037:00: SPI transfer timed out On recent v5.4, v5.6, and v5.7-rc series (confirmed on v5.4.43, v5.6.15, and v5.7-rc7), that driver stops working after putting (internal) display off/on then making a touch input. (On earlier versions, it stopped after making touch input two times, regardless of s2idle/display state) kern :err : [ +0.203592] Surface3-spi spi-MSHW0037:00: SPI transfer timed out kern :err : [ +0.000173] spi_master spi1: failed to transfer one message from queue This issue happens only with DMA mode. I found that release/setup DMA cycle helps to get that driver working again with DMA mode but I'm not sure where is the proper place to do so. So, for now, I put changes into runtime_suspend()/runtime_resume() because those gets called on every display off/on. This commit release/setup DMA on runtime_suspend()/runtime_resume() to avoid surface3_spi driver crashing. Note: Considering that this issue is reproducible by only display off/on, the cause of this issue is not related to s2idle. I also tried disabling runtime_pm by commenting out the following lines [drivers/spi/spi-pxa2xx.c] SET_RUNTIME_PM_OPS(pxa2xx_spi_runtime_suspend, pxa2xx_spi_runtime_resume, NULL) but still reproducible. (surface3_spi doesn't use runtime_pm). So, runtime_pm is also not relevant. TODO: Can someone guess why this is needed for Surface 3 (or surface3_spi driver) ? I don't even know where the root cause lies; who/where to ask? maybe the cause is not in surface3_spi or spi-pxa2xx code. Signed-off-by: Tsuchiya Yuto <kitakar@gmail.com> --- drivers/spi/spi-pxa2xx.c | 24 ++++++++++++++++++++++++ 1 file changed, 24 insertions(+) diff --git a/drivers/spi/spi-pxa2xx.c b/drivers/spi/spi-pxa2xx.c index f6e87344a36c8..d8d03dc5ac9c0 100644 --- a/drivers/spi/spi-pxa2xx.c +++ b/drivers/spi/spi-pxa2xx.c @@ -1978,6 +1978,13 @@ static int pxa2xx_spi_runtime_suspend(struct device *dev) { struct driver_data *drv_data = dev_get_drvdata(dev); + pr_info("DEBUG: pxa2xx_spi_runtime_suspend() called\n"); + + /* Release DMA. + * Code is extracted from pxa2xx_spi_remove(). */ + if (drv_data->controller_info->enable_dma) + pxa2xx_spi_dma_release(drv_data); + clk_disable_unprepare(drv_data->ssp->clk); return 0; } @@ -1987,6 +1994,23 @@ static int pxa2xx_spi_runtime_resume(struct device *dev) struct driver_data *drv_data = dev_get_drvdata(dev); int status; + pr_info("DEBUG: pxa2xx_spi_runtime_resume() called\n"); + + /* Setup DMA if requested. + * Code is extracted from pxa2xx_spi_probe(). */ + if (drv_data->controller_info) { + status = pxa2xx_spi_dma_setup(drv_data); + if (status) { + dev_warn(dev, "no DMA channels available, using PIO\n"); + drv_data->controller_info = false; + } else { + drv_data->controller->can_dma = pxa2xx_spi_can_dma; + drv_data->controller->max_dma_len = MAX_DMA_LEN; + drv_data->controller->max_transfer_size = + pxa2xx_spi_max_dma_transfer_size; + } + } + status = clk_prepare_enable(drv_data->ssp->clk); return status; } -- 2.26.2
(In reply to Tsuchiya Yuto (kitakar5525) from comment #4) ... > This issue happens only with DMA mode. > > I found that release/setup DMA cycle helps to get that driver working > again with DMA mode but I'm not sure where is the proper place to do so. > So, for now, I put changes into runtime_suspend()/runtime_resume() because > those gets called on every display off/on. > > This commit release/setup DMA on runtime_suspend()/runtime_resume() to > avoid surface3_spi driver crashing. ... > TODO: Can someone guess why this is needed for Surface 3 (or surface3_spi > driver) ? I don't even know where the root cause lies; who/where to ask? > maybe the cause is not in surface3_spi or spi-pxa2xx code. > + pr_info("DEBUG: pxa2xx_spi_runtime_suspend() called\n"); > + > + /* Release DMA. > + * Code is extracted from pxa2xx_spi_remove(). */ > + if (drv_data->controller_info->enable_dma) > + pxa2xx_spi_dma_release(drv_data); > + > clk_disable_unprepare(drv_data->ssp->clk); Ah, looks that's again related to the wrong state machine implementation / handling. Actually doesn't simple call of dma_terminate_sync() help with this? (we have function called pxa2xx_spi_dma_stop() for that)
(In reply to Andy Shevchenko from comment #5) > (In reply to Tsuchiya Yuto (kitakar5525) from comment #4) > > ... > > > This issue happens only with DMA mode. > > > > I found that release/setup DMA cycle helps to get that driver working > > again with DMA mode but I'm not sure where is the proper place to do so. > > So, for now, I put changes into runtime_suspend()/runtime_resume() because > > those gets called on every display off/on. Oh, what I've said is incorrect. runtime_suspend()/runtime_resume() gets called rather on every touch input stop/start. > > This commit release/setup DMA on runtime_suspend()/runtime_resume() to > > avoid surface3_spi driver crashing. > > ... > > > TODO: Can someone guess why this is needed for Surface 3 (or surface3_spi > > driver) ? I don't even know where the root cause lies; who/where to ask? > > maybe the cause is not in surface3_spi or spi-pxa2xx code. > > > + pr_info("DEBUG: pxa2xx_spi_runtime_suspend() called\n"); > > + > > + /* Release DMA. > > + * Code is extracted from pxa2xx_spi_remove(). */ > > + if (drv_data->controller_info->enable_dma) > > + pxa2xx_spi_dma_release(drv_data); > > + > > clk_disable_unprepare(drv_data->ssp->clk); > > Ah, looks that's again related to the wrong state machine implementation / > handling. > > Actually doesn't simple call of dma_terminate_sync() help with this? (we > have function called pxa2xx_spi_dma_stop() for that) Like this? diff --git a/drivers/spi/spi-pxa2xx.c b/drivers/spi/spi-pxa2xx.c index 73d2a65d0b6ef..c8d305cc9a373 100644 --- a/drivers/spi/spi-pxa2xx.c +++ b/drivers/spi/spi-pxa2xx.c @@ -1977,6 +1977,11 @@ static int pxa2xx_spi_runtime_suspend(struct device *dev) { struct driver_data *drv_data = dev_get_drvdata(dev); + /* Stop the DMA if running. + * Code is extracted from pxa2xx_spi_handle_err(). */ + if (atomic_read(&drv_data->dma_running)) + pxa2xx_spi_dma_stop(drv_data); + clk_disable_unprepare(drv_data->ssp->clk); return 0; } but this didn't work. It seems that we really need to release DMA (dma_release_channel()) and setup DMA again.
(In reply to Tsuchiya Yuto (kitakar5525) from comment #6) > (In reply to Andy Shevchenko from comment #5) > > (In reply to Tsuchiya Yuto (kitakar5525) from comment #4) > > > TODO: Can someone guess why this is needed for Surface 3 (or surface3_spi > > > driver) ? I don't even know where the root cause lies; who/where to ask? > > > maybe the cause is not in surface3_spi or spi-pxa2xx code. > > > > > + pr_info("DEBUG: pxa2xx_spi_runtime_suspend() called\n"); > > > + > > > + /* Release DMA. > > > + * Code is extracted from pxa2xx_spi_remove(). */ > > > + if (drv_data->controller_info->enable_dma) > > > + pxa2xx_spi_dma_release(drv_data); > > > + > > > clk_disable_unprepare(drv_data->ssp->clk); > > > > Ah, looks that's again related to the wrong state machine implementation / > > handling. > > > > Actually doesn't simple call of dma_terminate_sync() help with this? (we > > have function called pxa2xx_spi_dma_stop() for that) > > Like this? > > diff --git a/drivers/spi/spi-pxa2xx.c b/drivers/spi/spi-pxa2xx.c > index 73d2a65d0b6ef..c8d305cc9a373 100644 > --- a/drivers/spi/spi-pxa2xx.c > +++ b/drivers/spi/spi-pxa2xx.c > @@ -1977,6 +1977,11 @@ static int pxa2xx_spi_runtime_suspend(struct device > *dev) > { > struct driver_data *drv_data = dev_get_drvdata(dev); > > + /* Stop the DMA if running. > + * Code is extracted from pxa2xx_spi_handle_err(). */ > + if (atomic_read(&drv_data->dma_running)) > + pxa2xx_spi_dma_stop(drv_data); > + > clk_disable_unprepare(drv_data->ssp->clk); > return 0; > } > > but this didn't work. It seems that we really need to release DMA > (dma_release_channel()) and setup DMA again. Ah, that pxa2xx_spi_dma_stop() didn't get called. So, I tried the following diff --git a/drivers/spi/spi-pxa2xx.c b/drivers/spi/spi-pxa2xx.c index 73d2a65d0b6ef..a2b848b92b3ae 100644 --- a/drivers/spi/spi-pxa2xx.c +++ b/drivers/spi/spi-pxa2xx.c @@ -1977,6 +1977,9 @@ static int pxa2xx_spi_runtime_suspend(struct device *dev) { struct driver_data *drv_data = dev_get_drvdata(dev); + pr_info("DEBUG: before pxa2xx_spi_dma_stop()\n"); + pxa2xx_spi_dma_stop(drv_data); + clk_disable_unprepare(drv_data->ssp->clk); return 0; } but didn't work @@ -1986,6 +1989,9 @@ static int pxa2xx_spi_runtime_resume(struct device *dev) struct driver_data *drv_data = dev_get_drvdata(dev); int status; + pr_info("DEBUG: before pxa2xx_spi_dma_start()\n"); + pxa2xx_spi_dma_start(drv_data); + status = clk_prepare_enable(drv_data->ssp->clk); return status; } even if I add pxa2xx_spi_dma_start(), didn't work either.
This gives me another clue. It may be that DMA controller lost his context and thus get's uninitialized / off. Can you try two things (one-by-one and together) in drivers/dma/dw/core.c: 1/ comment out in dwc_initialize() ... if (test_bit(DW_DMA_IS_INITIALIZED, &dwc->flags)) return; ... 2/ Append do_dw_dma_on(dw); just before above mentioned lines.
First, I built 5.8-rc3 kernel with the following config so that I can rebuild only dw_dmac_core module: CONFIG_SERIAL_8250=m CONFIG_SERIAL_8250_LPSS=m CONFIG_DW_DMAC=m CONFIG_DW_DMAC_PCI=m Then I confirmed that touch input still stops after display off when using DMA mode with the kernel I built. (In reply to Andy Shevchenko from comment #8) > This gives me another clue. It may be that DMA controller lost his context > and thus get's uninitialized / off. > > Can you try two things (one-by-one and together) in drivers/dma/dw/core.c: > > 1/ comment out in dwc_initialize() > ... > if (test_bit(DW_DMA_IS_INITIALIZED, &dwc->flags)) > return; > ... > 2/ Append > do_dw_dma_on(dw); > just before above mentioned lines. A) only 1/ Touch input still working with DMA mode after display_off/s2idle B) only 2/ Touch input stopped working after display_off/s2idle C) both 1/ and 2/ [...] do_dw_dma_on(dw); // if (test_bit(DW_DMA_IS_INITIALIZED, &dwc->flags)) // return; [...] Touch input still working with DMA mode after display_off/s2idle Thank you.
https://git.kernel.org/pub/scm/linux/kernel/git/vkoul/dmaengine.git/commit/?id=99ba8b9b0d9780e9937eb1d488d120e9e5c2533d AFAIU this should fix the issue. If not, feel free to reopen.
Yes, I confirmed that the patch works for v5.8-rc3, also for v5.4.51 and v4.19.132 (with resolving small conflict for v4.19). I no longer see surface3-spi driver crashing after display_off/s2idle. Closing. Thanks!
Just for your info, I found a similar commit in intel-aero/aero-4.4.x kernel tree: https://github.com/intel-aero/linux-kernel/commit/3f5fcff79779310cea9da21b3ad808334854f5c6 > dma: dw: must initialize dma channel in each transaction > > CHT enable Autonomous D3 for LPSS DMA. When idle, DMA will power gated > and reset without driver knowledge, So need to initialize in each dma > transaction. The commit message describes exactly the same situation as observed here.
(In reply to Tsuchiya Yuto (kitakar5525) from comment #12) > Just for your info, I found a similar commit in intel-aero/aero-4.4.x > kernel tree: > > https://github.com/intel-aero/linux-kernel/commit/ > 3f5fcff79779310cea9da21b3ad808334854f5c6 Thanks, I was not aware that we have already something like this.