Bug 28462 - JMicron SD card DMA transfer corrupted
Summary: JMicron SD card DMA transfer corrupted
Status: CLOSED CODE_FIX
Alias: None
Product: Drivers
Classification: Unclassified
Component: MMC/SD (show other bugs)
Hardware: All Linux
: P1 normal
Assignee: drivers_mmc-sd
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2011-02-06 22:29 UTC by Mikko Vinni
Modified: 2011-05-30 08:29 UTC (History)
3 users (show)

See Also:
Kernel Version: post 2.6.36
Subsystem:
Regression: No
Bisected commit-id:


Attachments
dmesg from 2.6.36.3 with sdhci.c DEBUG (124.95 KB, text/plain)
2011-02-06 22:29 UTC, Mikko Vinni
Details

Description Mikko Vinni 2011-02-06 22:29:25 UTC
Created attachment 46662 [details]
dmesg from 2.6.36.3 with sdhci.c DEBUG

See http://www.mail-archive.com/linux-mmc@vger.kernel.org/msg05490.html

I have an HP Pavilion dv5 laptop, which has an integrated memory card reader
(labeled with "SD*MS/Pro*MMC*XD) and appearing with lspci as:

0a:00.1 System peripheral: JMicron Technology Corp. SD/MMC Host Controller
0a:00.2 SD Host controller: JMicron Technology Corp. Standard SD Host Controller
0a:00.3 System peripheral: JMicron Technology Corp. MS Host Controller
0a:00.4 System peripheral: JMicron Technology Corp. xD Host Controller

I have mainly used a 1 GB SD card from puremedia, but I have seen the problem with another card as well. Basically, if I try to read image files from the card using the card reader, the images come out corrupted (having the card inside a Canon Ixus camera and reading the images via a USB cable works).

---

Testing shows that also writes to the card are occasionally corrupted. When it happens, usually the transferred file begins ok, but every 4096 bytes the file might change to contain a block from some other part of the file or even binary garbage. Like so:

- the source file 49.txt consists of lines
0049-0001
0049-0002
0049-0003
...
0049-2399
0049-2400
0049-2401

The file is exactly 24010 bytes long.

After transferring that file to the SD card and back (via the card reader) and 
taking a diff, the differences are:

@@ -407,1995 +407,1992 @@
 0049-0407
 0049-0408
 0049-0409
-0049-0410
-0049-0411
-0049-0412
... (all lines in between removed; each line is 10 bytes)
-0049-2399
-0049-2400
-0049-2401
+0049-00049-0001
+0049-0002
+0049-0003
... (all lines in between added)
+0049-0408
+0049-0409
+0049-00049-0001
+0049-0002
+0049-0003
...
+0049-0408
+0049-0409
+0049-00049-0001
+0049-0002
+0049-0003
...
+0049-0407
+0049-0408
+0049-0409
+0049-00049-0001
+0049-0002
+0049-0003
...
+0049-0407
+0049-0408
+0049-0409
+0049-00049-0001
+0049-0002
+0049-0003
...
+0049-0351
+0049-0352
+0049-0353

dmesg from Linux 2.6.36.3 (which fails) attached, bisect results to follow
Comment 1 Mikko Vinni 2011-02-06 22:53:52 UTC
I bisected the problem between v2.6.36...v2.6.36.3 and v2.6.36...v2.6.37 (I couldn't reproduce the problem on plain v2.6.36).

Between v2.6.36...v2.6.36.3:

712fdd3c5b2b9cfa52a01b9fb22a3be9694bf147 is the first bad commit
commit 712fdd3c5b2b9cfa52a01b9fb22a3be9694bf147
Author: Alexey Starikovskiy <astarikovskiy@suse.de>
Date:   Thu Dec 9 17:07:54 2010 -0500

    ACPI: EC: Add another dmi match entry for MSI hardware
    
    commit a5dc4f898c2a0f66e2cefada6c687db82ba2fcbc upstream.
    
    http://bugzilla.kernel.org/show_bug.cgi?id=15418
    
    Signed-off-by: Alexey Starikovskiy <astarikovskiy@suse.de>
    Signed-off-by: Len Brown <len.brown@intel.com>
    Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>

diff --git a/drivers/acpi/ec.c b/drivers/acpi/ec.c
index f31291b..7bff18b 100644
--- a/drivers/acpi/ec.c
+++ b/drivers/acpi/ec.c
@@ -929,6 +929,9 @@ static struct dmi_system_id __initdata ec_dmi_table[] = {
        ec_flag_msi, "MSI hardware", {
        DMI_MATCH(DMI_CHASSIS_VENDOR, "MICRO-Star")}, NULL},
        {
+       ec_flag_msi, "MSI hardware", {
+       DMI_MATCH(DMI_CHASSIS_VENDOR, "MICRO-STAR")}, NULL},
+       {
        ec_validate_ecdt, "ASUS hardware", {
        DMI_MATCH(DMI_BIOS_VENDOR, "ASUS") }, NULL},
        {},


This commit should *not* cause any effect on this laptop, because according to dmidecode output: Chassis Information -> Manufacturer: Quanta

The text "Detected MSI hardware, enabling workarounds.\n" from ec_flag_msi() in drivers/acpi/ec.c does *not* appear in the logs. Nevertheless, I can revert this single patch from pure v2.6.36.3 and get a working system.

The config in this bisect was based on Arch Linux's 2.6.36 config, meaning most options turned on (as modules when possible). I tried to trim the config down to make the bisect run faster, but that itself made the problem go away without pinpointing the cause.

---

Between v2.6.36...v2.6.37:
72d7c3b33c980843e756681fb4867dc1efd62a76 is the first bad commit
commit 72d7c3b33c980843e756681fb4867dc1efd62a76
Author: Yinghai Lu <yinghai@kernel.org>
Date:   Wed Aug 25 13:39:17 2010 -0700

    x86: Use memblock to replace early_res


This is much more complex than the previous commit, but I doubt that it would be "buggy", due to the previous commit being wonderfully simple yet causing the same problem.

---

The problem still exists in 2.6.38-rc3-CUST-00201-g44f2c5c.
Comment 2 Chris Ball 2011-02-06 23:10:48 UTC
Thanks for the bisect, but it doesn't seem like these commits could really be causing the problem.  Having it disappear with a smaller config is also very suspicious.  Perhaps you should be bisecting the config changes too?  ;-)

Not sure where to go from here, I'm afraid..
Comment 3 Mikko Vinni 2011-02-06 23:22:30 UTC
I tried to bisect the config too, but got lost somewhere while trying to disable last of the remaining staging drivers. I got the impression that it wasn't about any particular config option, but maybe rather something about the resulting memory layout. I might resume that conquest if nothing else helps.

---

In another attempt I added '#define DEBUG' at the top of drivers/mmc/host/sdhci.c and compared the output between basic 2.6.36.3 and 2.6.36.3 with the commit 712fdd3 (comment #1) reverted. The main difference that I can see:

--- dmesg_2.6.36.3-CUSTI-dirty-orig-debug       2011-02-06 15:39:49.066634161
+0200
+++ dmesg_2.6.36.3-CUSTI-dirty-revert-debug-s2ram-fail2 2011-02-07
00:50:33.730000452 +0200

...
@@ -1017,7 +1015,6 @@
 sdhci [sdhci_irq()]: *** mmc0 got interrupt: 0x00000002
 sdhci [sdhci_irq()]: *** mmc0 got interrupt: 0x00000003
 sdhci [sdhci_irq()]: *** mmc0 got interrupt: 0x00000001
-sdhci [sdhci_irq()]: *** mmc0 got interrupt: 0x00000008
 sdhci [sdhci_irq()]: *** mmc0 got interrupt: 0x00000002
 sdhci [sdhci_irq()]: *** mmc0 got interrupt: 0x00000003
 sdhci [sdhci_irq()]: *** mmc0 got interrupt: 0x00000001
@@ -1901,6 +1898,8 @@
 sdhci [sdhci_irq()]: *** mmc0 got interrupt: 0x00000003
 sdhci [sdhci_irq()]: *** mmc0 got interrupt: 0x00000001
 sdhci [sdhci_irq()]: *** mmc0 got interrupt: 0x00000002
+sdhci [sdhci_irq()]: *** mmc0 got interrupt: 0x00000001
+sdhci [sdhci_irq()]: *** mmc0 got interrupt: 0x00000002
 sdhci [sdhci_irq()]: *** mmc0 got interrupt: 0x00000003
 sdhci [sdhci_irq()]: *** mmc0 got interrupt: 0x00000001
 sdhci [sdhci_irq()]: *** mmc0 got interrupt: 0x00000002
@@ -2183,7 +2182,6 @@
 sdhci [sdhci_irq()]: *** mmc0 got interrupt: 0x00000002
 sdhci [sdhci_irq()]: *** mmc0 got interrupt: 0x00000003
 sdhci [sdhci_irq()]: *** mmc0 got interrupt: 0x00000001
-sdhci [sdhci_irq()]: *** mmc0 got interrupt: 0x00000008
 sdhci [sdhci_irq()]: *** mmc0 got interrupt: 0x00000002
 sdhci [sdhci_irq()]: *** mmc0 got interrupt: 0x00000003
 sdhci [sdhci_irq()]: *** mmc0 got interrupt: 0x00000001
@@ -2207,7 +2205,6 @@
 sdhci [sdhci_irq()]: *** mmc0 got interrupt: 0x00000002
 sdhci [sdhci_irq()]: *** mmc0 got interrupt: 0x00000003
 sdhci [sdhci_irq()]: *** mmc0 got interrupt: 0x00000001
-sdhci [sdhci_irq()]: *** mmc0 got interrupt: 0x00000008
 sdhci [sdhci_irq()]: *** mmc0 got interrupt: 0x00000002
 sdhci [sdhci_irq()]: *** mmc0 got interrupt: 0x00000003
 sdhci [sdhci_irq()]: *** mmc0 got interrupt: 0x00000001
@@ -2232,7 +2229,6 @@
...

grep didn't find a single 'sdhci [sdhci_irq()]: *** mmc0 got interrupt:
0x00000008' line in the log of the kernel that works properly.

From drivers/mmc/host/sdhci.h:
#define  SDHCI_INT_DMA_END      0x00000008

There is this comment in drivers/mmc/host/sdhci.c (function sdhci_data_irq):

            /*
             * We currently don't do anything fancy with DMA
             * boundaries, but as we can't disable the feature
             * we need to at least restart the transfer.
             */
            if (intmask & SDHCI_INT_DMA_END)
                    sdhci_writel(host, sdhci_readl(host, SDHCI_DMA_ADDRESS),
                            SDHCI_DMA_ADDRESS);
            if (intmask & SDHCI_INT_DATA_END) {
                    ...

Is there any chance that the handling of SDHCI_INT_DMA_END would not work
properly?
Comment 4 Chris Ball 2011-02-07 00:01:21 UTC
It's possible, but we'd expect such a bug to be bisectable.

I wonder if only bisecting changes inside drivers/mmc/ would get you a more reliable bisect run..
Comment 5 Mikko Vinni 2011-02-13 17:04:57 UTC
They looked reliable. Retesting around 72d7c3b33c980843e756681fb4867dc1efd62a76 (the second suspect) pointed to the same merged branch, unsetting CONFIG_NO_BOOTMEM moved it beyond 72d7c3 but in a couple of commits after that CONFIG_NO_BOOTMEM=y couldn't be avoided anymore. Anyway, onto some test results.

Apologies for a long comment. With these debug printk's (in
2.6.38-rc4-CUST-00145-g3c6c0d6):

diff --git a/drivers/mmc/host/sdhci.c b/drivers/mmc/host/sdhci.c
index 9e15f41..8a54c50 100644
--- a/drivers/mmc/host/sdhci.c
+++ b/drivers/mmc/host/sdhci.c
@@ -730,6 +730,9 @@ static void sdhci_prepare_data(struct sdhci_host *host, struct mmc_data *data)
 
		if (unlikely(broken)) {
			for_each_sg(data->sg, sg, data->sg_len, i) {
+				printk(KERN_DEBUG "sg:: pl=%lx off=%x len=%x dma=%lx\n",
+					sg->page_link, sg->offset,
+					sg->length, (unsigned long) sg->dma_address);
				if (sg->offset & 0x3) {
					DBG("Reverting to PIO because of "
						"bad alignment\n");
@@ -771,6 +774,9 @@ static void sdhci_prepare_data(struct sdhci_host *host, struct mmc_data *data)
				host->flags &= ~SDHCI_REQ_USE_DMA;
			} else {
				WARN_ON(sg_cnt != 1);
+				printk(KERN_DEBUG "%s>> DMA_ADDRESS %08x\n",
+					mmc_hostname(host->mmc),
+				      (unsigned int) sg_dma_address(data->sg));
				sdhci_writel(host, sg_dma_address(data->sg),
					SDHCI_DMA_ADDRESS);
			}
@@ -1545,9 +1552,12 @@ static void sdhci_data_irq(struct sdhci_host *host, u32 intmask)
		 * boundaries, but as we can't disable the feature
		 * we need to at least restart the transfer.
		 */
-		if (intmask & SDHCI_INT_DMA_END)
-			sdhci_writel(host, sdhci_readl(host, SDHCI_DMA_ADDRESS),
-				SDHCI_DMA_ADDRESS);
+		if (intmask & SDHCI_INT_DMA_END) {
+			u32 old = sdhci_readl(host, SDHCI_DMA_ADDRESS);
+			printk(KERN_DEBUG "%s>>DMA_END DMA_ADDRESS %08x\n",
+			      mmc_hostname(host->mmc), old);
+			sdhci_writel(host, old, SDHCI_DMA_ADDRESS);
+		}
 
		if (intmask & SDHCI_INT_DATA_END) {
			if (host->cmd) {

---
The output is like this:
...
Feb 13 12:53:01 koni kernel: sg:: pl=ffffea0003808882 off=0 len=10000 dma=0
Feb 13 12:53:01 koni kernel: mmc0>> DMA_ADDRESS bbd6c000
Feb 13 12:53:01 koni kernel: sdhci [sdhci_irq()]: *** mmc0 got interrupt: 0x00000001
Feb 13 12:53:01 koni kernel: sdhci [sdhci_irq()]: *** mmc0 got interrupt: 0x00000002
Feb 13 12:53:01 koni kernel: sdhci [sdhci_irq()]: *** mmc0 got interrupt: 0x00000003
Feb 13 12:53:01 koni kernel: mmc0: req done (CMD18): 0: 00000900 00000000 00000000 00000000
Feb 13 12:53:01 koni kernel: mmc0:     65536 bytes transferred: 0
Feb 13 12:53:01 koni kernel: mmc0:     (CMD12): 0: 00000b00 00000000 00000000 00000000
Feb 13 12:53:01 koni kernel: mmc0: starting CMD18 arg 002c0000 flags 000000b5
Feb 13 12:53:01 koni kernel: mmc0:     blksz 512 blocks 128 flags 00000200 tsac 100 ms nsac 0
Feb 13 12:53:01 koni kernel: mmc0:     CMD12 arg 00000000 flags 0000049d
Feb 13 12:53:01 koni kernel: sg:: pl=ffffea0003808882 off=0 len=10000 dma=0
Feb 13 12:53:01 koni kernel: mmc0>> DMA_ADDRESS bbd7c000
Feb 13 12:53:01 koni kernel: sdhci [sdhci_irq()]: *** mmc0 got interrupt: 0x00000001
Feb 13 12:53:01 koni kernel: sdhci [sdhci_irq()]: *** mmc0 got interrupt: 0x00000008
Feb 13 12:53:01 koni kernel: mmc0>>DMA_END DMA_ADDRESS bbd7c000
Feb 13 12:53:01 koni kernel: sdhci [sdhci_irq()]: *** mmc0 got interrupt: 0x00000008
Feb 13 12:53:01 koni kernel: mmc0>>DMA_END DMA_ADDRESS bbd7c000
Feb 13 12:53:01 koni kernel: sdhci [sdhci_irq()]: *** mmc0 got interrupt: 0x00000008
Feb 13 12:53:01 koni kernel: mmc0>>DMA_END DMA_ADDRESS bbd7c000
Feb 13 12:53:01 koni kernel: sdhci [sdhci_irq()]: *** mmc0 got interrupt: 0x00000002
Feb 13 12:53:01 koni kernel: sdhci [sdhci_irq()]: *** mmc0 got interrupt: 0x00000003
Feb 13 12:53:01 koni kernel: mmc0: req done (CMD18): 0: 00000900 00000000 00000000 00000000
Feb 13 12:53:01 koni kernel: mmc0:     65536 bytes transferred: 0
Feb 13 12:53:01 koni kernel: mmc0:     (CMD12): 0: 00000b00 00000000 00000000 00000000
Feb 13 12:53:01 koni kernel: mmc0: starting CMD18 arg 002d0000 flags 000000b5
Feb 13 12:53:01 koni kernel: mmc0:     blksz 512 blocks 128 flags 00000200 tsac 100 ms nsac 0
Feb 13 12:53:01 koni kernel: mmc0:     CMD12 arg 00000000 flags 0000049d
Feb 13 12:53:01 koni kernel: sg:: pl=ffffea0003808882 off=0 len=10000 dma=0
Feb 13 12:53:01 koni kernel: mmc0>> DMA_ADDRESS bbd8c000
Feb 13 12:53:01 koni kernel: sdhci [sdhci_irq()]: *** mmc0 got interrupt: 0x00000001
Feb 13 12:53:01 koni kernel: sdhci [sdhci_irq()]: *** mmc0 got interrupt: 0x00000002
Feb 13 12:53:01 koni kernel: sdhci [sdhci_irq()]: *** mmc0 got interrupt: 0x00000003
Feb 13 12:53:01 koni kernel: mmc0: req done (CMD18): 0: 00000900 00000000 00000000 00000000
Feb 13 12:53:01 koni kernel: mmc0:     65536 bytes transferred: 0
Feb 13 12:53:01 koni kernel: mmc0:     (CMD12): 0: 00000b00 00000000 00000000 00000000
...

Based on reading the code and spec ("SD Host Controller Simplified
Specification Version 2.00 February 8, 2007"), and seeing the lines

Feb 13 12:53:01 koni kernel: mmc0>> DMA_ADDRESS bbd7c000
Feb 13 12:53:01 koni kernel: mmc0>>DMA_END DMA_ADDRESS bbd7c000 (3 times)

in the log, I understand/assume:
1) Host SDMA Buffer Boundary is set to 512 KB
2) The Host Controller (HW) generates the DMA interrupt when the transfer
   would cross the boundary (address ends with either 80000 or 00000 in hex)
   0xbbd7c000 + 65536 = 0xbbd8c000
   0xbbd80000 - 0xbbd7c000 = 16384; 65536/16384=4 (three interrupts)
3) spec: "After SDMA has stopped, the next system address of the next
   contiguous data position can be read from this register."
4) With this HW, while handling the dma interrupt,
   sdhci_readl(host, SDHCI_DMA_ADDRESS) returns the original address
	  --> rule or exception (I don't have other SD readers to compare)?

If the bug really originates from 4), then the earlier bisections pointed
to commits that happen to make the DMA addresses more likely to approach
the 512KB boundaries from below. Even in the log snippet above the dma
interrupts don't trigger with every 65536 byte transfer because the addresses
are nicer.

Another code change to test the theory:

diff --git a/drivers/mmc/host/sdhci.c b/drivers/mmc/host/sdhci.c
index 9e15f41..8a54c50 100644
--- a/drivers/mmc/host/sdhci.c
+++ b/drivers/mmc/host/sdhci.c
@@ -808,7 +814,8 @@ static void sdhci_prepare_data(struct sdhci_host *host, struct mmc_data *data)
	sdhci_set_transfer_irqs(host);
 
	/* We do not handle DMA boundaries, so set it to max (512 KiB) */
-	sdhci_writew(host, SDHCI_MAKE_BLKSZ(7, data->blksz), SDHCI_BLOCK_SIZE);
+	/* --> 4KB buffer boundary for testing SDHCI_INT_DMA_END */
+	sdhci_writew(host, SDHCI_MAKE_BLKSZ(0, data->blksz), SDHCI_BLOCK_SIZE);
	sdhci_writew(host, data->blocks, SDHCI_BLOCK_COUNT);
 }
 
With this change I can make (at least) one of the mmc_test tests fail:
...
mmc0: Starting tests of card mmc0:1234...
mmc0: Test case 6. Multi-block read...
mmc0: starting CMD16 arg 00000200 flags 00000095
sdhci [sdhci_irq()]: *** mmc0 got interrupt: 0x00000001
mmc0: req done (CMD16): 0: 00000900 00000000 00000000 00000000
mmc0: starting CMD24 arg 00000000 flags 00000035
mmc0:     blksz 512 blocks 1 flags 00000100 tsac 300 ms nsac 0
sg:: pl=ffffea0003e26202 off=0 len=200 dma=0
mmc0>> DMA_ADDRESS bb580800
sdhci [sdhci_irq()]: *** mmc0 got interrupt: 0x00000001
sdhci [sdhci_irq()]: *** mmc0 got interrupt: 0x00000002
mmc0: req done (CMD24): 0: 00000900 00000000 00000000 00000000
mmc0:     512 bytes transferred: 0
...
mmc0: req done (CMD24): 0: 00000900 00000000 00000000 00000000
mmc0:     512 bytes transferred: 0
mmc0: starting CMD13 arg 12340000 flags 00000015
sdhci [sdhci_irq()]: *** mmc0 got interrupt: 0x00000001
mmc0: req done (CMD13): 0: 00000900 00000000 00000000 00000000
mmc0: starting CMD16 arg 00000200 flags 00000095
sdhci [sdhci_irq()]: *** mmc0 got interrupt: 0x00000001
mmc0: req done (CMD16): 0: 00000900 00000000 00000000 00000000
mmc0: starting CMD18 arg 00000000 flags 00000035
mmc0:     blksz 512 blocks 16 flags 00000200 tsac 100 ms nsac 0
mmc0:     CMD12 arg 00000000 flags 0000001d
sg:: pl=ffffea0003e26202 off=0 len=2000 dma=0
mmc0>> DMA_ADDRESS bb591000
sdhci [sdhci_irq()]: *** mmc0 got interrupt: 0x00000001
sdhci [sdhci_irq()]: *** mmc0 got interrupt: 0x00000008
mmc0>>DMA_END DMA_ADDRESS bb591000
sdhci [sdhci_irq()]: *** mmc0 got interrupt: 0x00000002
sdhci [sdhci_irq()]: *** mmc0 got interrupt: 0x00000003
mmc0: req done (CMD18): 0: 00000900 00000000 00000000 00000000
mmc0:     8192 bytes transferred: 0
mmc0:     (CMD12): 0: 00000b00 00000000 00000000 00000000
mmc0: starting CMD13 arg 12340000 flags 00000015
sdhci [sdhci_irq()]: *** mmc0 got interrupt: 0x00000001
mmc0: req done (CMD13): 0: 00000900 00000000 00000000 00000000
mmc0: Result: FAILED
mmc0: starting CMD16 arg 00000200 flags 00000095
sdhci [sdhci_irq()]: *** mmc0 got interrupt: 0x00000001
mmc0: req done (CMD16): 0: 00000900 00000000 00000000 00000000
...

[root@koni mmc0:1234]# cat test
Test 6: 1

With this modification the file system on the card is unusable (it was
tested *before* using mmc_test):
...
mmc0>>DMA_END DMA_ADDRESS bb4fc000
sdhci [sdhci_irq()]: *** mmc0 got interrupt: 0x00000008
mmc0>>DMA_END DMA_ADDRESS bb4fc000
sdhci [sdhci_irq()]: *** mmc0 got interrupt: 0x00000008
mmc0>>DMA_END DMA_ADDRESS bb4fc000
sdhci [sdhci_irq()]: *** mmc0 got interrupt: 0x00000002
sdhci [sdhci_irq()]: *** mmc0 got interrupt: 0x00000003
mmc0: req done (CMD18): 0: 00000900 00000000 00000000 00000000
mmc0:     65536 bytes transferred: 0
mmc0:     (CMD12): 0: 00000b00 00000000 00000000 00000000
FAT: Filesystem error (dev mmcblk0p1)
    fat_get_cluster: invalid cluster chain (i_pos 0)
FAT: Filesystem has been set read-only
...

mikko ~ $  LANG=C ls /media/CANON_DC/
ls: cannot access /media/CANON_DC/DCIM: Input/output error
ls: cannot access /media/CANON_DC/tmpp: Input/output error
DCIM  tmpp


If it is the hardware that is at fault here, and not all the
card readers behave the same, how can one fix this?

I suppose it's possible to calculate the next address from the
old dma address, but that will fail if some hardware actually
updates the address itself.
Comment 6 Mikko Vinni 2011-02-21 20:46:10 UTC
The patch below on top of 2.6.37.1, 2.6.36.4, and
v2.6.38-rc5-100-g0cc9d52 makes the DMA transfers work reliably.
Tested by viewing some photos from the SD card and by writing
and reading an 836MB iso file.

The original code (commit 6ba736a10e) was merged for v2.6.22-rc2,
and that part of the code hasn't changed since.
Unless some other code change has changed the behaviour of
this particular hardware in the mean time, I don't think
this bug is a regression. Marking as non-regression for now.


Does this patch make sense? Should I send it to the mailing list?

The idea of the patch is that as long as each transfer is at most
512KB (on this reader max seems to be 65536 bytes) each of them
will cross the boundary at most once, and thus we don't need to
save any intermediate values anywhere. Only need to check if
the HW updates the address itself or if we need to kick it a
little bit.

The added printout looks like this:

sdhci [sdhci_irq()]: *** mmc0 got interrupt: 0x00000008
sdhci [sdhci_data_irq()]: mmc0: next DMA address forced from be1fc000 to be200000
(hmm, maybe 0x prefix for the hex numbers would look nice)


diff --git a/drivers/mmc/host/sdhci.c b/drivers/mmc/host/sdhci.c
index a25db42..469cf74 100644
--- a/drivers/mmc/host/sdhci.c
+++ b/drivers/mmc/host/sdhci.c
@@ -1537,9 +1537,27 @@ static void sdhci_data_irq(struct sdhci_host *host, u32 intmask)
 		 * boundaries, but as we can't disable the feature
 		 * we need to at least restart the transfer.
 		 */
-		if (intmask & SDHCI_INT_DMA_END)
-			sdhci_writel(host, sdhci_readl(host, SDHCI_DMA_ADDRESS),
-				SDHCI_DMA_ADDRESS);
+		if (intmask & SDHCI_INT_DMA_END) {
+			u32 dmastart, dmanow;
+			dmastart = sg_dma_address(host->data->sg);
+			dmanow = sdhci_readl(host, SDHCI_DMA_ADDRESS);
+			if (dmanow == dmastart) {
+				/*
+				 * HW failed to increase the address.
+				 * Update to the next 512KB block boundary.
+				 */
+				dmanow = (dmanow & ~0x7ffff) + 0x80000;
+				if (dmanow > dmastart + host->data->blksz *
+							host->data->blocks) {
+					WARN_ON(1);
+					dmanow = dmastart;
+				}
+				DBG("%s: next DMA address forced "
+				    "from %08x to %08x\n",
+				    mmc_hostname(host->mmc), dmastart, dmanow);
+			}
+			sdhci_writel(host, dmanow, SDHCI_DMA_ADDRESS);
+		}
 
 		if (intmask & SDHCI_INT_DATA_END) {
 			if (host->cmd) {
Comment 7 Chris Ball 2011-02-21 20:47:55 UTC
Yes, please do send the patch to the list for review.  Thanks!
Comment 8 Florian Mickler 2011-05-30 07:58:03 UTC
A patch referencing this bug report has been merged in v3.0-rc1:

commit f6a03cbf43e586211f8ea088148c8ecd3fc4b5be
Author: Mikko Vinni <mmvinni@yahoo.com>
Date:   Tue Apr 12 09:36:18 2011 -0400

    mmc: sdhci: work around broken dma boundary behavior

Note You need to log in before you can comment on or make changes to this bug.