Updating from 6.5.5 to 6.8.9 makes my system frequently fail with the following error showing up in dmesg: On 6.8.9: mmc0: error -95 doing runtime resume causing the filesystem on the card to stop, after several read writes. reverting to 6.5.5 solves the issue This seems to be due to modifications in drivers/misc/cardreader the RTSX_PCI driver. The Microsoft Surface Go 2 card reader is detected as RTS522A even on 6.5.5: mmc0: cannot verify signal voltage switch is common
Are you using vanilla kernel or something that is close to vanilla? What does "makes my system frequently fail" exactly mean? Fail to boot? And could you maybe bisect the issue: https://docs.kernel.org/admin-guide/verify-bugs-and-bisect-regressions.html
(In reply to The Linux kernel's regression tracker (Thorsten Leemhuis) from comment #1) > What does "makes my system frequently fail" exactly mean? Fail to boot? Ignore that, I had missed the "causing the filesystem on the card to stop, after several read writes."
I am using the gentoo patchset on the kernel, further patched in with camera drivers for the surface go 2. None of these patches should touch the rstx_pci driver or carddriver directory. I run the root filesystem on the SD card. To be more specific a SanDisc 128GB A2 V30. Therefore the system fails once the sd-card does not respond. The failure is probably rare for people not running a root filesystem on the SSD as the carddreader seems to work fine, booting the system, but seems to fail after about 10-15 minutes, of activity, as stated with dmesg showing the error message above: mmc0: error -95 doing runtime resume and the filesystem not responding anymore. The SD Card is ext4 formatted, without mounted with journaling turned off. I will provide kernel configuration files for both 6.5.5 and 6.8.9 later that day, in the hope that it helps. As said reverting to 6.5.5 works fine only yielding a mmc0: cannot verify signal voltage switch but I get this error since years, every other minute, on this system - also with different SD Cards, without any noticeable effect.
The problem starts already at kernel 6.6, at least at 6.6.52 I also tried to modify: drivers/mmc/core/core.c essentially reverting it back to it's 6.5.5 state: diff linux-6.5.5-gentoo/drivers/mmc/core/core.c linux-6.6.52-gentoo/drivers/mmc/core/core.c 554c554,556 < mmc_wait_for_cmd(host, &cmd, 0); --- > mmc_wait_for_cmd(host, &cmd, MMC_CMD_RETRIES); > > mmc_poll_for_busy(host->card, MMC_CQE_RECOVERY_TIMEOUT, true, > MMC_BUSY_IO); 562c564 < err = mmc_wait_for_cmd(host, &cmd, 0); --- > err = mmc_wait_for_cmd(host, &cmd, MMC_CMD_RETRIES); 564a567,569 > > if (err) > err = mmc_wait_for_cmd(host, &cmd, MMC_CMD_RETRIES); doing this I was semi successful, as mmc0: error -95 doing runtime resume still showed up, but after around 1 minute of waiting the driver seemed to have recovered and continued working, until failing again a few minutes later. Which causes large long stalls on when trying to use the card.
Created attachment 306908 [details] kernel config with working mmc driver as of kernel 6.5.5
Created attachment 306909 [details] Kernel config with non-working mmc driver as of 6.6.52
I also tried an other card: With SanDisc 128GB A2 V30: i get mmc0: error -95 doing runtime resume With ADATA 256GB A1 V10: i get mmc0: error -84 doing runtime resume using 6.6.52, again with 6.5.5 everything works.
Ulf Hansson <ulf.hansson@linaro.org> replies to comment #2: + Ricky On Sun, 22 Sept 2024 at 15:35, The Linux kernel's regression tracker (Thorsten Leemhuis) via Bugspray Bot <bugbot@kernel.org> wrote: > > The Linux kernel's regression tracker (Thorsten Leemhuis) writes via > Kernel.org Bugzilla: > > (In reply to The Linux kernel's regression tracker (Thorsten Leemhuis) from > comment #1) > > What does "makes my system frequently fail" exactly mean? Fail to boot? > > Ignore that, I had missed the "causing the filesystem on the card to stop, > after several read writes." > > View: https://bugzilla.kernel.org/show_bug.cgi?id=218821#c2 > You can reply to this message to join the discussion. Did you try to revert the below commit? 0e4cac557531 misc: rtsx: Fix some platforms can not boot and move the l1ss judgment to probe Kind regards Uffe (via https://msgid.link/CAPDyKFq4-fL3oHeT9phThWQJqzicKeA447WBJUbtcKPhdZ2d1A@mail.gmail.com)
I tried to revert the commit that you pointed out by applying the following patch on 6.12.5. The card in question so still fails with error -95. For the moment I am stuck at 6.5.5. Thanks for helping out so far, but commit 0e4cac557531 misc: rtsx: Fix some platforms can not boot and move the l1ss judgment to probe does not seem to be the root of the problem. All the best, Thomas
Created attachment 307427 [details] Patch to revert changes from commit 0e4cac557531 [ Does not solve the problem ]
I'd suggest trying to bisect as mentioned earlier https://docs.kernel.org/admin-guide/verify-bugs-and-bisect-regressions.html
I bisected the problem, and it seems come from the blocklayer. Before this change the SD cards in my Surface GO 2 behave correctly, afterwards they fail after a couple of minutes usage, especially on card rw intensive tasks. The SD card basically stops working correctly after the following commit: smr /usr/src/linux # git bisect good 65a558f66c308251e256317957b75d1e643c33c3 is the first bad commit commit 65a558f66c308251e256317957b75d1e643c33c3 Author: Bart Van Assche <bvanassche@acm.org> Date: Fri Jul 21 10:27:30 2023 -0700 block: Improve performance for BLK_MQ_F_BLOCKING drivers blk_mq_run_queue() runs the queue asynchronously if BLK_MQ_F_BLOCKING has been set. This is suboptimal since running the queue asynchronously is slower than running the queue synchronously. This patch modifies blk_mq_run_queue() as follows if BLK_MQ_F_BLOCKING has been set: - Run the queue synchronously if it is allowed to sleep. - Run the queue asynchronously if it is not allowed to sleep. Additionally, blk_mq_run_hw_queue(hctx, false) calls are modified into blk_mq_run_hw_queue(hctx, hctx->flags & BLK_MQ_F_BLOCKING) if the caller may be invoked from atomic context. The following caller chains have been reviewed: blk_mq_run_hw_queue(hctx, false) blk_mq_get_tag() /* may sleep, hence the functions it calls may also sleep */ blk_execute_rq() /* may sleep */ blk_mq_run_hw_queues(q, async=false) blk_freeze_queue_start() /* may sleep */ blk_mq_requeue_work() /* may sleep */ scsi_kick_queue() scsi_requeue_run_queue() /* may sleep */ scsi_run_host_queues() scsi_ioctl_reset() /* may sleep */ blk_mq_insert_requests(hctx, ctx, list, run_queue_async=false) blk_mq_dispatch_plug_list(plug, from_sched=false) blk_mq_flush_plug_list(plug, from_schedule=false) __blk_flush_plug(plug, from_schedule=false) blk_add_rq_to_plug() blk_mq_submit_bio() /* may sleep if REQ_NOWAIT has not been set */ blk_mq_plug_issue_direct() blk_mq_flush_plug_list() /* see above */ blk_mq_dispatch_plug_list(plug, from_sched=false) blk_mq_flush_plug_list() /* see above */ blk_mq_try_issue_directly() blk_mq_submit_bio() /* may sleep if REQ_NOWAIT has not been set */ blk_mq_try_issue_list_directly(hctx, list) blk_mq_insert_requests() /* see above */ Cc: Christoph Hellwig <hch@lst.de> Cc: Ming Lei <ming.lei@redhat.com> Signed-off-by: Bart Van Assche <bvanassche@acm.org> Link: https://lore.kernel.org/r/20230721172731.955724-4-bvanassche@acm.org Signed-off-by: Jens Axboe <axboe@kernel.dk> block/blk-mq.c | 16 ++++++++++------ drivers/scsi/scsi_lib.c | 3 ++- 2 files changed, 12 insertions(+), 7 deletions(-)
Created attachment 307693 [details] Patch for 6.12.16 which solves the problem on the surface go 2 By reversing the bisected commit it was possible to me to build this patch for the current 6.12.16 kernel which yields stable sd card operation on the surface go 2.
On 21/02/25 16:55, bugzilla-daemon@kernel.org wrote: > https://bugzilla.kernel.org/show_bug.cgi?id=218821 > > --- Comment #13 from Thomas Haschka (haschka@gmail.com) --- > Created attachment 307693 [details] > --> https://bugzilla.kernel.org/attachment.cgi?id=307693&action=edit > Patch for 6.12.16 which solves the problem on the surface go 2 > > By reversing the bisected commit it was possible to me to build this patch > for > the current 6.12.16 kernel which yields stable sd card operation on the > surface go 2. I do not really see how that commit could affect the card, but it could be that it results in runtime suspend then runtime resume happening very close together. If there were insufficient delays to allow voltage levels to reach the correct values, it could result in the card misbehaving as seen.
Ulf Hansson <ulf.hansson@linaro.org> replies to comment #14: On Tue, 11 Mar 2025 at 13:54, Adrian Hunter via Bugspray Bot <bugbot@kernel.org> wrote: > > Adrian Hunter writes via Kernel.org Bugzilla: > > On 21/02/25 16:55, bugzilla-daemon@kernel.org wrote: > > https://bugzilla.kernel.org/show_bug.cgi?id=218821 > > > > --- Comment #13 from Thomas Haschka (haschka@gmail.com) --- > > Created attachment 307693 [details] > > --> https://bugzilla.kernel.org/attachment.cgi?id=307693&action=edit > > Patch for 6.12.16 which solves the problem on the surface go 2 > > > > By reversing the bisected commit it was possible to me to build this patch > > for > > the current 6.12.16 kernel which yields stable sd card operation on the > > surface go 2. > > I do not really see how that commit could affect the card, but it could > be that it results in runtime suspend then runtime resume happening very > close together. If there were insufficient delays to allow voltage levels > to reach the correct values, it could result in the card misbehaving as > seen. I agree, it shouldn't. Unless, as you say, it somehow triggers our runtime PM callbacks for the SD card (mmc_sd_runtime_suspend() and mmc_sd_runtime_resume()) to trigger too frequently. We have the runtime PM autosuspend timeout set default to 3000 ms. We are internally in mmc block layer reference counting runtime PM, rather than relying on the block layer to do this for us. Could it be that our autosuspend timeout gets overridden from the generic block layer, somehow? Anyway, I have suggested dropping MMC_CAP_AGGRESSIVE_PM from drivers/mmc/host/rtsx_pci_sdmmc.c, to see if that helps. > > View: https://bugzilla.kernel.org/show_bug.cgi?id=218821#c14 > You can reply to this message to join the discussion. > -- > Deet-doot-dot, I am a bot. > Kernel.org Bugzilla (bugspray 0.1-dev) > Kind regards Uffe (via https://msgid.link/CAPDyKFoiYQAM5b%2BAGiebTbSW8GNs1ppjkYbAbs8mt1-kxX2GUA@mail.gmail.com)
As Uffe suggested i tried to remove MMC_CAP_AGGRESSIVE_PM from drivers/mmc/host/rtsx_pci_sdmmc.c I add the patch so that you can verify what I did. It did however not solve the problem. All the best, - Thomas
Created attachment 307814 [details] Patch to remove MMC_CAP_AGGRESSIVE_PM ( does not fix the problem )
Ulf Hansson <ulf.hansson@linaro.org> replies to comment #16: On Thu, 13 Mar 2025 at 12:04, Thomas Haschka via Bugspray Bot <bugbot@kernel.org> wrote: > > Thomas Haschka writes via Kernel.org Bugzilla: > > As Uffe suggested i tried to remove MMC_CAP_AGGRESSIVE_PM from > drivers/mmc/host/rtsx_pci_sdmmc.c > > I add the patch so that you can verify what I did. The patch seems okay to me! > > It did however not solve the problem. That was really surprising to me. So are you still getting the error "mmc0: error -95 doing runtime resume"? Or something else? If the same error occurs, I am puzzled. The code path should not be executed when MMC_CAP_AGGRESSIVE_PM is unset. Perhaps add a few prints in mmc_sd_runtime_suspend() to make sure what code path we are running? pr_err("%s: %s\n", mmc_hostname(host), __func__); if (!(host->caps & MMC_CAP_AGGRESSIVE_PM)) return 0; pr_err("%s: %s - AGGRESSIVE_PM\n", mmc_hostname(host), __func__); [...] Kind regards Uffe (via https://msgid.link/CAPDyKFrfB2W9YBe%2BXR7%3DTv67zivJ4bVt%2BSyuEH2evY%2B4KWN_MA@mail.gmail.com)