Bug 218821 - RTS522A fails with "mmc: error -95 doing runtime resume" on Microsoft Surface Go 2
Summary: RTS522A fails with "mmc: error -95 doing runtime resume" on Microsoft Surface...
Status: NEW
Alias: None
Product: Drivers
Classification: Unclassified
Component: MMC/SD (show other bugs)
Hardware: Intel Linux
: P3 normal
Assignee: drivers_mmc-sd
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2024-05-08 18:42 UTC by Thomas Haschka
Modified: 2025-03-13 13:39 UTC (History)
2 users (show)

See Also:
Kernel Version:
Subsystem:
Regression: No
Bisected commit-id:


Attachments
kernel config with working mmc driver as of kernel 6.5.5 (161.60 KB, text/plain)
2024-09-22 13:33 UTC, Thomas Haschka
Details
Kernel config with non-working mmc driver as of 6.6.52 (163.12 KB, text/plain)
2024-09-22 13:34 UTC, Thomas Haschka
Details
Patch to revert changes from commit 0e4cac557531 [ Does not solve the problem ] (13.78 KB, patch)
2024-12-31 09:03 UTC, Thomas Haschka
Details | Diff
Patch for 6.12.16 which solves the problem on the surface go 2 (2.74 KB, patch)
2025-02-21 14:55 UTC, Thomas Haschka
Details | Diff
Patch to remove MMC_CAP_AGGRESSIVE_PM ( does not fix the problem ) (583 bytes, patch)
2025-03-13 11:01 UTC, Thomas Haschka
Details | Diff

Description Thomas Haschka 2024-05-08 18:42:12 UTC
Updating from 6.5.5 to 6.8.9

makes my system frequently fail with the following error showing up in dmesg:

On 6.8.9:
mmc0: error -95 doing runtime resume

causing the filesystem on the card to stop, after several read writes.

reverting to 6.5.5 solves the issue

This seems to be due to modifications in
drivers/misc/cardreader the RTSX_PCI driver. 

The Microsoft Surface Go 2 card reader is detected as RTS522A

even on 6.5.5: 
mmc0: cannot verify signal voltage switch is common
Comment 1 The Linux kernel's regression tracker (Thorsten Leemhuis) 2024-05-10 06:23:01 UTC
Are you using vanilla kernel or something that is close to vanilla? 

What does "makes my system frequently fail" exactly mean? Fail to boot?

And could you maybe bisect the issue: https://docs.kernel.org/admin-guide/verify-bugs-and-bisect-regressions.html
Comment 2 The Linux kernel's regression tracker (Thorsten Leemhuis) 2024-05-10 06:23:59 UTC
(In reply to The Linux kernel's regression tracker (Thorsten Leemhuis) from comment #1)
> What does "makes my system frequently fail" exactly mean? Fail to boot?

Ignore that, I had missed the "causing the filesystem on the card to stop, after several read writes."
Comment 3 Thomas Haschka 2024-05-10 09:53:52 UTC
I am using the gentoo patchset on the kernel, further patched in with camera drivers for the surface go 2.

None of these patches should touch the rstx_pci driver or carddriver directory.

I run the root filesystem on the SD card. To be more specific a
SanDisc 128GB A2 V30. Therefore the system fails once the sd-card does not respond. 

The failure is probably rare for people not running a root filesystem on the SSD as the carddreader seems to work fine, booting the system, but seems to fail after about 10-15 minutes, of activity, as stated with dmesg showing the error message above: 

mmc0: error -95 doing runtime resume

and the filesystem not responding anymore. The SD Card is ext4 formatted, without mounted with journaling turned off. 

I will provide kernel configuration files for both 6.5.5 and 6.8.9 later that day, in the hope that it helps. 

As said reverting to 6.5.5 works fine only yielding a 

mmc0: cannot verify signal voltage switch

but I get this error since years, every other minute, on this system - also with different SD Cards, without any noticeable effect.
Comment 4 Thomas Haschka 2024-09-22 13:32:42 UTC
The problem starts already at kernel 6.6, at least at 6.6.52

I also tried to modify: 

drivers/mmc/core/core.c

essentially reverting it back to it's 6.5.5 state:

diff linux-6.5.5-gentoo/drivers/mmc/core/core.c linux-6.6.52-gentoo/drivers/mmc/core/core.c 
554c554,556
<       mmc_wait_for_cmd(host, &cmd, 0);
---
>       mmc_wait_for_cmd(host, &cmd, MMC_CMD_RETRIES);
> 
>       mmc_poll_for_busy(host->card, MMC_CQE_RECOVERY_TIMEOUT, true,
>       MMC_BUSY_IO);
562c564
<       err = mmc_wait_for_cmd(host, &cmd, 0);
---
>       err = mmc_wait_for_cmd(host, &cmd, MMC_CMD_RETRIES);
564a567,569
> 
>       if (err)
>               err = mmc_wait_for_cmd(host, &cmd, MMC_CMD_RETRIES);

doing this I was semi successful, as

mmc0: error -95 doing runtime resume

still showed up, but after around 1 minute of waiting the driver seemed to have recovered and continued working, until failing again a few minutes later. Which causes large long stalls on when trying to use the card.
Comment 5 Thomas Haschka 2024-09-22 13:33:53 UTC
Created attachment 306908 [details]
kernel config with working mmc driver as of kernel 6.5.5
Comment 6 Thomas Haschka 2024-09-22 13:34:30 UTC
Created attachment 306909 [details]
Kernel config with non-working mmc driver as of 6.6.52
Comment 7 Thomas Haschka 2024-09-24 12:26:30 UTC
I also tried an other card: 

With SanDisc 128GB A2 V30: i get mmc0: error -95 doing runtime resume
With   ADATA 256GB A1 V10: i get mmc0: error -84 doing runtime resume

using 6.6.52, 

again with 6.5.5 everything works.
Comment 8 Bugspray Bot 2024-10-08 15:00:22 UTC
Ulf Hansson <ulf.hansson@linaro.org> replies to comment #2:

+ Ricky

On Sun, 22 Sept 2024 at 15:35, The Linux kernel's regression tracker
(Thorsten Leemhuis) via Bugspray Bot <bugbot@kernel.org> wrote:
>
> The Linux kernel's regression tracker (Thorsten Leemhuis) writes via
> Kernel.org Bugzilla:
>
> (In reply to The Linux kernel's regression tracker (Thorsten Leemhuis) from
> comment #1)
> > What does "makes my system frequently fail" exactly mean? Fail to boot?
>
> Ignore that, I had missed the "causing the filesystem on the card to stop,
> after several read writes."
>
> View: https://bugzilla.kernel.org/show_bug.cgi?id=218821#c2
> You can reply to this message to join the discussion.

Did you try to revert the below commit?

0e4cac557531 misc: rtsx: Fix some platforms can not boot and move the
l1ss judgment to probe

Kind regards
Uffe

(via https://msgid.link/CAPDyKFq4-fL3oHeT9phThWQJqzicKeA447WBJUbtcKPhdZ2d1A@mail.gmail.com)
Comment 9 Thomas Haschka 2024-12-31 09:01:49 UTC
I tried to revert the commit that you pointed out by applying the following patch on 6.12.5. 

The card in question so still fails with error -95. 

For the moment I am stuck at 6.5.5. 

Thanks for helping out so far, but commit 
0e4cac557531 misc: rtsx: Fix some platforms can not boot and move the
l1ss judgment to probe

does not seem to be the root of the problem. 

All the best, 
Thomas
Comment 10 Thomas Haschka 2024-12-31 09:03:34 UTC
Created attachment 307427 [details]
Patch to revert changes from commit 0e4cac557531 [ Does not solve the problem ]
Comment 11 Adrian Hunter 2025-01-02 06:27:00 UTC
I'd suggest trying to bisect as mentioned earlier

https://docs.kernel.org/admin-guide/verify-bugs-and-bisect-regressions.html
Comment 12 Thomas Haschka 2025-02-21 13:39:09 UTC
I bisected the problem, and it seems come from the blocklayer. 

Before this change the SD cards in my Surface GO 2 behave correctly, afterwards they fail after a couple of minutes usage, especially on card rw intensive tasks. 

The SD card basically stops working correctly after the following commit:

smr /usr/src/linux # git bisect good 
65a558f66c308251e256317957b75d1e643c33c3 is the first bad commit
commit 65a558f66c308251e256317957b75d1e643c33c3
Author: Bart Van Assche <bvanassche@acm.org>
Date:   Fri Jul 21 10:27:30 2023 -0700

    block: Improve performance for BLK_MQ_F_BLOCKING drivers
    
    blk_mq_run_queue() runs the queue asynchronously if BLK_MQ_F_BLOCKING
    has been set. This is suboptimal since running the queue asynchronously
    is slower than running the queue synchronously. This patch modifies
    blk_mq_run_queue() as follows if BLK_MQ_F_BLOCKING has been set:
    - Run the queue synchronously if it is allowed to sleep.
    - Run the queue asynchronously if it is not allowed to sleep.
    Additionally, blk_mq_run_hw_queue(hctx, false) calls are modified into
    blk_mq_run_hw_queue(hctx, hctx->flags & BLK_MQ_F_BLOCKING) if the caller
    may be invoked from atomic context.
    
    The following caller chains have been reviewed:
    
    blk_mq_run_hw_queue(hctx, false)
      blk_mq_get_tag()      /* may sleep, hence the functions it calls may also sleep */
      blk_execute_rq()             /* may sleep */
      blk_mq_run_hw_queues(q, async=false)
        blk_freeze_queue_start()   /* may sleep */
        blk_mq_requeue_work()      /* may sleep */
        scsi_kick_queue()
          scsi_requeue_run_queue() /* may sleep */
          scsi_run_host_queues()
            scsi_ioctl_reset()     /* may sleep */
      blk_mq_insert_requests(hctx, ctx, list, run_queue_async=false)
        blk_mq_dispatch_plug_list(plug, from_sched=false)
          blk_mq_flush_plug_list(plug, from_schedule=false)
            __blk_flush_plug(plug, from_schedule=false)
            blk_add_rq_to_plug()
              blk_mq_submit_bio()  /* may sleep if REQ_NOWAIT has not been set */
      blk_mq_plug_issue_direct()
        blk_mq_flush_plug_list()   /* see above */
      blk_mq_dispatch_plug_list(plug, from_sched=false)
        blk_mq_flush_plug_list()   /* see above */
      blk_mq_try_issue_directly()
        blk_mq_submit_bio()        /* may sleep if REQ_NOWAIT has not been set */
      blk_mq_try_issue_list_directly(hctx, list)
        blk_mq_insert_requests() /* see above */
    
    Cc: Christoph Hellwig <hch@lst.de>
    Cc: Ming Lei <ming.lei@redhat.com>
    Signed-off-by: Bart Van Assche <bvanassche@acm.org>
    Link: https://lore.kernel.org/r/20230721172731.955724-4-bvanassche@acm.org
    Signed-off-by: Jens Axboe <axboe@kernel.dk>

 block/blk-mq.c          | 16 ++++++++++------
 drivers/scsi/scsi_lib.c |  3 ++-
 2 files changed, 12 insertions(+), 7 deletions(-)
Comment 13 Thomas Haschka 2025-02-21 14:55:37 UTC
Created attachment 307693 [details]
Patch for 6.12.16 which solves the problem on the surface go 2

By reversing the bisected commit it was possible to me to build this patch for the current 6.12.16 kernel which yields stable sd card operation on the 
surface go 2.
Comment 14 Adrian Hunter 2025-03-11 12:51:49 UTC
On 21/02/25 16:55, bugzilla-daemon@kernel.org wrote:
> https://bugzilla.kernel.org/show_bug.cgi?id=218821
> 
> --- Comment #13 from Thomas Haschka (haschka@gmail.com) ---
> Created attachment 307693 [details]
>   --> https://bugzilla.kernel.org/attachment.cgi?id=307693&action=edit
> Patch for 6.12.16 which solves the problem on the surface go 2
> 
> By reversing the bisected commit it was possible to me to build this patch
> for
> the current 6.12.16 kernel which yields stable sd card operation on the 
> surface go 2.

I do not really see how that commit could affect the card, but it could
be that it results in runtime suspend then runtime resume happening very
close together.  If there were insufficient delays to allow voltage levels
to reach the correct values, it could result in the card misbehaving as
seen.
Comment 15 Bugspray Bot 2025-03-12 12:49:41 UTC
Ulf Hansson <ulf.hansson@linaro.org> replies to comment #14:

On Tue, 11 Mar 2025 at 13:54, Adrian Hunter via Bugspray Bot
<bugbot@kernel.org> wrote:
>
> Adrian Hunter writes via Kernel.org Bugzilla:
>
> On 21/02/25 16:55, bugzilla-daemon@kernel.org wrote:
> > https://bugzilla.kernel.org/show_bug.cgi?id=218821
> >
> > --- Comment #13 from Thomas Haschka (haschka@gmail.com) ---
> > Created attachment 307693 [details]
> >   --> https://bugzilla.kernel.org/attachment.cgi?id=307693&action=edit
> > Patch for 6.12.16 which solves the problem on the surface go 2
> >
> > By reversing the bisected commit it was possible to me to build this patch
> > for
> > the current 6.12.16 kernel which yields stable sd card operation on the
> > surface go 2.
>
> I do not really see how that commit could affect the card, but it could
> be that it results in runtime suspend then runtime resume happening very
> close together.  If there were insufficient delays to allow voltage levels
> to reach the correct values, it could result in the card misbehaving as
> seen.

I agree, it shouldn't. Unless, as you say, it somehow triggers our
runtime PM callbacks for the SD card (mmc_sd_runtime_suspend() and
mmc_sd_runtime_resume()) to trigger too frequently.

We have the runtime PM autosuspend timeout set default to 3000 ms. We
are internally in mmc block layer reference counting runtime PM,
rather than relying on the block layer to do this for us. Could it be
that our autosuspend timeout gets overridden from the generic block
layer, somehow?

Anyway, I have suggested dropping MMC_CAP_AGGRESSIVE_PM from
drivers/mmc/host/rtsx_pci_sdmmc.c, to see if that helps.

>
> View: https://bugzilla.kernel.org/show_bug.cgi?id=218821#c14
> You can reply to this message to join the discussion.
> --
> Deet-doot-dot, I am a bot.
> Kernel.org Bugzilla (bugspray 0.1-dev)
>

Kind regards
Uffe

(via https://msgid.link/CAPDyKFoiYQAM5b%2BAGiebTbSW8GNs1ppjkYbAbs8mt1-kxX2GUA@mail.gmail.com)
Comment 16 Thomas Haschka 2025-03-13 10:59:38 UTC
As Uffe suggested i tried to remove MMC_CAP_AGGRESSIVE_PM from 
drivers/mmc/host/rtsx_pci_sdmmc.c

I add the patch so that you can verify what I did. 

It did however not solve the problem.

All the best, 
- Thomas
Comment 17 Thomas Haschka 2025-03-13 11:01:04 UTC
Created attachment 307814 [details]
Patch to remove MMC_CAP_AGGRESSIVE_PM ( does not fix the problem )
Comment 18 Bugspray Bot 2025-03-13 13:39:42 UTC
Ulf Hansson <ulf.hansson@linaro.org> replies to comment #16:

On Thu, 13 Mar 2025 at 12:04, Thomas Haschka via Bugspray Bot
<bugbot@kernel.org> wrote:
>
> Thomas Haschka writes via Kernel.org Bugzilla:
>
> As Uffe suggested i tried to remove MMC_CAP_AGGRESSIVE_PM from
> drivers/mmc/host/rtsx_pci_sdmmc.c
>
> I add the patch so that you can verify what I did.

The patch seems okay to me!

>
> It did however not solve the problem.

That was really surprising to me. So are you still getting the error
"mmc0: error -95 doing runtime resume"? Or something else?

If the same error occurs, I am puzzled. The code path should not be
executed when MMC_CAP_AGGRESSIVE_PM is unset. Perhaps add a few prints
in mmc_sd_runtime_suspend() to make sure what code path we are
running?

pr_err("%s: %s\n", mmc_hostname(host), __func__);

 if (!(host->caps & MMC_CAP_AGGRESSIVE_PM))
                return 0;

pr_err("%s: %s - AGGRESSIVE_PM\n", mmc_hostname(host), __func__);

[...]

Kind regards
Uffe

(via https://msgid.link/CAPDyKFrfB2W9YBe%2BXR7%3DTv67zivJ4bVt%2BSyuEH2evY%2B4KWN_MA@mail.gmail.com)

Note You need to log in before you can comment on or make changes to this bug.