Bug 106331

Summary: MMC times out during installation (Surface 3)
Product: Drivers Reporter: Bastien Nocera (bugzilla)
Component: MMC/SDAssignee: drivers_mmc-sd
Status: NEW ---    
Severity: normal CC: andrea, dan, stanislav.v.v, stephenjust
Priority: P1    
Hardware: All   
OS: Linux   
Kernel Version: 4.3 rc5 Subsystem:
Regression: No Bisected commit-id:
Attachments: 4.3 rc5 kernel full dmesg with timeouts
dmesg Lenovo Ideapad 120s

Description Bastien Nocera 2015-10-20 09:42:35 UTC
Created attachment 190611 [details]
4.3 rc5 kernel full dmesg with timeouts

On 4.3.0-0.rc4.git3.2.fc24.x86_64 (and later on 4.3 rc5), the MMC would time out and then become unusable during a Fedora installation on a Microsoft Surface 3 (mine is a 4GB/128GB one), sometimes while formatting partitions, sometimes while doing the installation itself (which uses rsync to copy everything from the original USB disk to the internal MMC).

Oct 09 18:09:24 localhost kernel: mmc0: starting CMD25 arg 05ba4800 flags 000000b5
Oct 09 18:09:24 localhost kernel: mmc0:     blksz 512 blocks 1024 flags 00000100 tsac 1200 ms nsac 8000
Oct 09 18:09:24 localhost kernel: mmc0:     CMD12 arg 00000000 flags 0000049d
Oct 09 18:09:24 localhost kernel: sdhci [sdhci_irq()]: *** mmc0 got interrupt: 0x00000001
Oct 09 18:09:34 localhost kernel: mmc0: Timeout waiting for hardware interrupt.
Oct 09 18:09:34 localhost kernel: sdhci: =========== REGISTER DUMP (mmc0)===========
Oct 09 18:09:34 localhost kernel: sdhci: Sys addr: 0x00000400 | Version:  0x00001002
Oct 09 18:09:34 localhost kernel: sdhci: Blk size: 0x00007200 | Blk cnt:  0x00000203
Oct 09 18:09:34 localhost kernel: sdhci: Argument: 0x05ba4800 | Trn mode: 0x0000002b
Oct 09 18:09:34 localhost kernel: sdhci: Present:  0x1fff0106 | Host ctl: 0x00000035
Oct 09 18:09:34 localhost kernel: sdhci: Power:    0x0000000b | Blk gap:  0x00000080
Oct 09 18:09:34 localhost kernel: sdhci: Wake-up:  0x00000000 | Clock:    0x00000007
Oct 09 18:09:34 localhost kernel: sdhci: Timeout:  0x00000008 | Int stat: 0x00000000
Oct 09 18:09:34 localhost kernel: sdhci: Int enab: 0x02ff000b | Sig enab: 0x02ff000b
Oct 09 18:09:34 localhost kernel: sdhci: AC12 err: 0x00000000 | Slot int: 0x00000000
Oct 09 18:09:34 localhost kernel: sdhci: Caps:     0x446cc8b2 | Caps_1:   0x00000007
Oct 09 18:09:34 localhost kernel: sdhci: Cmd:      0x0000193a | Max curr: 0x00000000
Oct 09 18:09:34 localhost kernel: sdhci: Host ctl2: 0x0000008b
Oct 09 18:09:34 localhost kernel: sdhci: ADMA Err: 0x00000000 | ADMA Ptr: 0x3fbf90e0
Oct 09 18:09:34 localhost kernel: sdhci: ===========================================
Oct 09 18:09:34 localhost kernel: sdhci [sdhci_irq()]: *** mmc0 got interrupt: 0x00000001
Oct 09 18:09:34 localhost kernel: sdhci [sdhci_irq()]: *** mmc0 got interrupt: 0x00000002
Oct 09 18:09:34 localhost kernel: mmc0: req done <CMD23>: 0: 00000000 00000000 00000000 00000000
Oct 09 18:09:34 localhost kernel: mmc0: req done (CMD25): 0: 00000900 00000000 00000000 00000000
Oct 09 18:09:34 localhost kernel: mmc0:     0 bytes transferred: -110
Oct 09 18:09:34 localhost kernel: mmc0:     (CMD12): 0: 00000d00 00000000 00000000 00000000
Oct 09 18:09:34 localhost kernel: mmc0: starting CMD13 arg 00010000 flags 00000195
Oct 09 18:09:34 localhost kernel: sdhci [sdhci_irq()]: *** mmc0 got interrupt: 0x00000001
Oct 09 18:09:34 localhost kernel: mmc0: req done (CMD13): 0: 00000900 00000000 00000000 00000000
Oct 09 18:09:34 localhost kernel: mmc0: starting CMD13 arg 00010000 flags 00000195
Oct 09 18:09:34 localhost kernel: sdhci [sdhci_irq()]: *** mmc0 got interrupt: 0x00000001
Oct 09 18:09:34 localhost kernel: mmc0: req done (CMD13): 0: 00000900 00000000 00000000 00000000
Oct 09 18:09:34 localhost kernel: mmcblk0: error -110 transferring data, sector 96094208, nr 1024, cmd response 0x900, card status 0xd00
Oct 09 18:09:34 localhost kernel: blk_update_request: I/O error, dev mmcblk0, sector 96094208
Oct 09 18:09:34 localhost kernel: blk_update_request: I/O error, dev mmcblk0, sector 96094216
Oct 09 18:09:34 localhost kernel: blk_update_request: I/O error, dev mmcblk0, sector 96094224
Oct 09 18:09:34 localhost kernel: blk_update_request: I/O error, dev mmcblk0, sector 96094232
Oct 09 18:09:34 localhost kernel: blk_update_request: I/O error, dev mmcblk0, sector 96094240
Oct 09 18:09:34 localhost kernel: blk_update_request: I/O error, dev mmcblk0, sector 96094248
Oct 09 18:09:34 localhost kernel: blk_update_request: I/O error, dev mmcblk0, sector 96094256
Oct 09 18:09:34 localhost kernel: blk_update_request: I/O error, dev mmcblk0, sector 96094264
Oct 09 18:09:34 localhost kernel: blk_update_request: I/O error, dev mmcblk0, sector 96094272
Oct 09 18:09:34 localhost kernel: blk_update_request: I/O error, dev mmcblk0, sector 96094280

Trying to raise the default timeout to 10 times that failed to solve the problem, the problem happened just a little bit later.

Reverting those 2 patches solves the timeouts and allows the installation to complete:
commit 66c39dfc92f9d35ed9f713833156547842086891
Author: Adrian Hunter <adrian.hunter <at> intel.com>
Date:   Thu May 7 13:10:21 2015 +0300

    mmc: sdhci: Change to new way of doing re-tuning
    
    Make use of mmc core support for re-tuning instead
    of doing it all in the sdhci driver.
    
    This patch also changes to flag the need for re-tuning
    always after runtime suspend when tuning has been used
    at initialization. Previously it was only done if
    the re-tuning timer was in use.
    
    Signed-off-by: Adrian Hunter <adrian.hunter <at> intel.com>
    Signed-off-by: Ulf Hansson <ulf.hansson <at> linaro.org>

commit b69587e2d5b09a192c45c604ea1f9e8d51f4c3a1
Author: Adrian Hunter <adrian.hunter <at> intel.com>
Date:   Fri Feb 6 14:13:01 2015 +0200

    mmc: sdhci-pci: Enable HS400 for some Intel host controllers
    
    Enable detection of HS400 support via capability bit-63
    for some Intel host controllers.
    
    Signed-off-by: Adrian Hunter <adrian.hunter <at> intel.com>
    Signed-off-by: Ulf Hansson <ulf.hansson <at> linaro.org>

Reverting just one of those 2 commits is not enough to solve the timeout problems.

The discussion is originally from:
http://thread.gmane.org/gmane.linux.kernel.mmc/34277
Comment 1 Dan 2016-10-27 17:04:47 UTC
Hi

We are seeing similar symptoms on another Cherry Trail platform, the UP board [1], which has an Intel Atom X5-Z8350 SoC.

Some details from the kernel log
[    2.291332] mmc0: SDHCI controller on ACPI [80860F14:00] using ADMA
...
[    2.423293] mmc0: new HS200 MMC card at address 0001
[    2.428541] mmcblk0: mmc0:0001 M52564 58.3 GiB 
[    2.428744] mmcblk0boot0: mmc0:0001 M52564 partition 1 4.00 MiB
[    2.428915] mmcblk0boot1: mmc0:0001 M52564 partition 2 4.00 MiB
[    2.429072] mmcblk0rpmb: mmc0:0001 M52564 partition 3 4.00 MiB
[    2.431859]  mmcblk0: p1 p2 p3
...

The eMMC flash part is a 64GB Kingston M525 chip.

We're currently seeing this issue on a v4.4 kernel, and have also reproduced on the v4.8 kernel.  In normal use, it occurs at random, and not frequently.  But if we run a test with intensive disk I/O and CPU/GPU activity, it can be reproduced quite reliably in less than 30 minutes.

The pattern seems to be that we see timeouts during tuning, which occur from time to time and mostly appear harmless.  But, on some occasions, this timeout is followed 10 seconds later by a "Timeout waiting for hardware interrupt".  This is the nasty one, because it typically causes some write operation to fail and can be fatal for the application.

[103887.417739] sdhci: Timeout waiting for Buffer Read Ready interrupt during tuning procedure, falling back to fixed sampling clock
[103897.561325] mmc0: Timeout waiting for hardware interrupt.
[103897.561344] sdhci: =========== REGISTER DUMP (mmc0)===========
[103897.561352] sdhci: Sys addr: 0x00000400 | Version:  0x00001002
[103897.561358] sdhci: Blk size: 0x00007200 | Blk cnt:  0x00000400
[103897.561363] sdhci: Argument: 0x01180800 | Trn mode: 0x0000002b
[103897.561368] sdhci: Present:  0x1fff0001 | Host ctl: 0x00000035
[103897.561374] sdhci: Power:    0x0000000b | Blk gap:  0x00000080
[103897.561379] sdhci: Wake-up:  0x00000000 | Clock:    0x00000007
[103897.561390] sdhci: Timeout:  0x00000008 | Int stat: 0x00000000
[103897.561395] sdhci: Int enab: 0x02ff000b | Sig enab: 0x02ff000b
[103897.561401] sdhci: AC12 err: 0x00000000 | Slot int: 0x00000000
[103897.561406] sdhci: Caps:     0x446cc8b2 | Caps_1:   0x00000807
[103897.561412] sdhci: Cmd:      0x0000193a | Max curr: 0x00000000
[103897.561416] sdhci: Host ctl2: 0x0000000b
[103897.561421] sdhci: ADMA Err: 0x00000000 | ADMA Ptr: 0x36400200
[103897.561424] sdhci: ===========================================
[103897.565662] mmcblk0: error -110 sending stop command, original cmd response 0x0, card status 0x400900
[103897.565704] mmcblk0: error -110 transferring data, sector 18352128, nr 1024, cmd response 0x0, card status 0x0

It doesn't occur in quite the same way as described in the original ticket, so it may be a different issue, but perhaps its related.  It's interesting that it is occurring on a very similar platform.

We've tried various experiments to try to find a workaround/solution on the UP board.  In our case, removing the commits mentioned in the original message didn't help.  Increasing the command timeout (the one that triggers the hardware interrupt timeout error) by 50ms didn't help either.

The only thing that seems to make a real difference is increasing the timeout for the tuning, from 50ms to 100ms.  It perhaps hints that not leaving enough time before cleaning up after the failed tuning can have a knock-on effect on subsequent commands?

Has this issue been observed on any other Cherry Trail platforms?

[1] http://www.up-board.org/up/
Comment 2 Stanislav V. Vlasov 2018-09-10 08:30:39 UTC
Got this on Lenovo Ideapad 120S:
Tried kernels from Debian distribution: 4.9.110, 4.19.0-rc2 with same result.

dmesg:

```
[ 7487.305755] JBD2: Detected IO errors while flushing file data on mmcblk0p3-8
[ 7497.466460] mmc0: Timeout waiting for hardware interrupt.
[ 7497.466475] mmc0: sdhci: ============ SDHCI REGISTER DUMP ===========
[ 7497.466485] mmc0: sdhci: Sys addr:  0x00000030 | Version:  0x00001002
[ 7497.466492] mmc0: sdhci: Blk size:  0x00007200 | Blk cnt:  0x00000030
[ 7497.466500] mmc0: sdhci: Argument:  0x018ce3f0 | Trn mode: 0x0000002b
[ 7497.466507] mmc0: sdhci: Present:   0x1fff0001 | Host ctl: 0x0000003d
[ 7497.466514] mmc0: sdhci: Power:     0x0000000b | Blk gap:  0x00000080
[ 7497.466521] mmc0: sdhci: Wake-up:   0x00000000 | Clock:    0x00000007
[ 7497.466529] mmc0: sdhci: Timeout:   0x00000006 | Int stat: 0x00000000
[ 7497.466536] mmc0: sdhci: Int enab:  0x02ff000b | Sig enab: 0x02ff000b
[ 7497.466543] mmc0: sdhci: AC12 err:  0x00000000 | Slot int: 0x00000000
[ 7497.466551] mmc0: sdhci: Caps:      0x546ec881 | Caps_1:   0x00000807
[ 7497.466558] mmc0: sdhci: Cmd:       0x0000193a | Max curr: 0x00000000
[ 7497.466565] mmc0: sdhci: Resp[0]:   0x00000900 | Resp[1]:  0x00000000
[ 7497.466572] mmc0: sdhci: Resp[2]:   0x00000000 | Resp[3]:  0x00001900
[ 7497.466578] mmc0: sdhci: Host ctl2: 0x0000008b
[ 7497.466587] mmc0: sdhci: ADMA Err:  0x00000000 | ADMA Ptr: 0x000000017442b200
[ 7497.466591] mmc0: sdhci: ============================================
[ 7497.472074] mmc0: cache flush error -110
[ 7497.472086] print_req_error: I/O error, dev mmcblk0, sector 26010656
[ 7497.472104] Aborting journal on device mmcblk0p3-8.
[ 7497.526410] mmc0: Tuning timeout, falling back to fixed sampling clock
[ 7497.529355] EXT4-fs error (device mmcblk0p3): ext4_journal_check_start:61: Detected aborted journal
[ 7497.529361] EXT4-fs (mmcblk0p3): Remounting filesystem read-only
```

hardware:

```
[    0.804066] pci 0000:00:1c.0: [8086:5acc] type 00 class 0x080501
[    0.804102] pci 0000:00:1c.0: reg 0x10: [mem 0x91327000-0x91327fff 64bit]
[    0.804120] pci 0000:00:1c.0: reg 0x18: [mem 0x91328000-0x91328fff 64bit]

[    2.424201] sdhci: Secure Digital Host Controller Interface driver
[    2.424204] sdhci: Copyright(c) Pierre Ossman
[    2.426056] sdhci-pci 0000:00:1c.0: SDHCI controller found [8086:5acc] (rev b)

[    2.437453] mmc0: SDHCI controller on PCI [0000:00:1c.0] using ADMA 64-bit

[    2.560657] mmc0: new HS200 MMC card at address 0001
[    2.566365] mmcblk0: mmc0:0001 DF4032 29.1 GiB 
[    2.566956] mmcblk0boot0: mmc0:0001 DF4032 partition 1 4.00 MiB
[    2.567539] mmcblk0boot1: mmc0:0001 DF4032 partition 2 4.00 MiB
[    2.568126] mmcblk0rpmb: mmc0:0001 DF4032 partition 3 4.00 MiB, chardev (246:0)
[    2.572339]  mmcblk0: p1 p2 p3 p4
```

full dmesg: https://pastebin.com/7jfaEjkk
Comment 3 Stanislav V. Vlasov 2018-09-10 08:38:15 UTC
Created attachment 278403 [details]
dmesg Lenovo Ideapad 120s

attach full dmesg for 4.19.0-rc2