Bug 42679

Summary: DMA Read on Marvell 88SE9128 fails when Intel's IOMMU is on
Product: IO/Storage Reporter: Paweł Żak (pawel.zaq)
Component: Serial ATAAssignee: Jeff Garzik (jgarzik)
Status: RESOLVED CODE_FIX    
Severity: normal CC: acooks, aklhfex, alan, alex.williamson, biergaizi2009, bill.hudacek, bjorn, bugs, cJ-kernel, cjtuckerjr, dev, elliott, f.bluethner, forestix, frollic, grythumn, is, javid.kayvan+kernel, john, k8wtaylnuuz7, kernel, kernel, kevosev23194, linux, listenmitglied, lizhenhua, LK7S2ED64JHGLKj75shg9klejHWG49h5hk, lk, lt-83, mart.bogdan, michael, microsoftenator, mspeder, oh-itsme, oliver.kahrmann, pawel.zaq, public, qemu, r1ch4rd.thompson, sam, stijn+bugs, szg00000, tasos, Theoretically.x64, tom, tradofox, yourpadremb, zhen-hual
Priority: P1    
Hardware: All   
OS: Linux   
Kernel Version: 3.2.1 Subsystem:
Regression: No Bisected commit-id:
Attachments: Output of `dmesg' command
Output of `lspci -knnv' command
Kernel config
config file / kernel 3.2.2
Output of `lspci -knnv' command
kernel config
dmesg intel z68, asus rampage III gene, vt-d enable
lspci, asus rampage III gene, z68, vt-d enable, 3.2.13
lspci output including device capabilities
dmesg on Asus P9X79 WS, kernel 3.7.3
lspci -knvv on Asus P9X79 WS, kernel 3.7.3
Patch with quirk for incorrect PCI requester IDs
Patch with quirk for incorrect PCIe requester IDs.
Logs and assorted information with respect to IOMMU issues
lspcis for Gigabyte GA-X79-UP4 Rev. 1.1
lspci and logs from tycho (asus z9pe-d8 ws, X79 + C600 chipset) running 3.15-rc3
More Verbose DMESG output for tycho (z9pe-d8 WS, x79 + C600)
Comparison of kernel 3.10.25 and 3.15.0 w/ iommu=pt without quirks entries
without and witch patch on Gigabyte GA-X79-UP4 Marvell 88SE9172
Trouble passing through my Highpoint RocketRAID 640L PCIE Storage controller to a domU in my XenCenter home server.
dmesg intremap=nosid
dmesg intremap=off
dmesg intel_iommu=off
dmesg intel_iommu=on
dmesg pci=nomsi
dmesg linux 3.17
dmesg of 4.0.5 vanilla kernel with iommu=on
dmesg of 4.0.5 patched kernel with iommu=on

Description Paweł Żak 2012-01-28 17:55:38 UTC
Created attachment 72217 [details]
Output of `dmesg' command

I have a MSI Z68A-GD80 B3 motherboard and when I try to enable Intel's IOMMU (kernel booted with intel_iommu=on), integrated Marvell 88SE9128 SATA controller doesn't work.

To reproduce:
1. Compile and prepare kernel with Intel IOMMU support enabled (CONFIG_INTEL_IOMMU=y).
2. Reboot the computer.
3. Enter BIOS and enable VT-d.
4. Boot the kernel with intel_iommu=on parameter.

Right after boot, kernel reports the following errors (SATA controller is at 0b:00.0):

[    2.639774] DRHD: handling fault status reg 3
[    2.639782] DMAR:[DMA Read] Request device [0b:00.1] fault addr fff00000 
[    2.639783] DMAR:[fault reason 02] Present bit in context entry is clear

After a while these entries appear:

[    7.625837] ata14.00: qc timeout (cmd 0xa1)
[    7.628341] ata14.00: failed to IDENTIFY (I/O error, err_mask=0x4)
[    7.935483] ata14: SATA link up 1.5 Gbps (SStatus 113 SControl 300)
[   17.908407] ata14.00: qc timeout (cmd 0xa1)
[   17.910935] ata14.00: failed to IDENTIFY (I/O error, err_mask=0x4)
[   17.912276] ata14: limiting SATA link speed to 1.5 Gbps
[   18.219077] ata14: SATA link up 1.5 Gbps (SStatus 113 SControl 310)
[   48.134607] ata14.00: qc timeout (cmd 0xa1)
[   48.137508] ata14.00: failed to IDENTIFY (I/O error, err_mask=0x4)
[   48.444646] ata14: SATA link up 1.5 Gbps (SStatus 113 SControl 310)

When there is a disk connected to the controller it does not work. When there are none, computer starts normally, apart from the huge lag caused by, presumably, probing the device.

Since this is the secondary controller on these motherboards, to eliminate those symptoms you can just plug disk in one of available ports of the built-in Intel SATA controller and disable Marvell's one using BIOS. The other work-around, if you need to use eSATA capabilities of the latter, is to disable VT-d techonology also using BIOS.
Comment 1 Paweł Żak 2012-01-28 17:56:30 UTC
Created attachment 72218 [details]
Output of `lspci -knnv' command
Comment 2 Paweł Żak 2012-01-28 17:58:19 UTC
Created attachment 72219 [details]
Kernel config
Comment 3 Korneliusz Jarzębski 2012-02-16 23:23:18 UTC
The same problem occurs on a Z68A-GD65 MSI G3 system Marvell 88SE91xx.

grep DMAR:

ACPI: DMAR beaff508 000B0 (v01 ALASKA    A M I 00000001 INTL 00000001)
DMAR: Host address width 36
DMAR: DRHD base: 0x000000fed91000 flags: 0x1
DMAR: RMRR base: 0x000000bf4cc000 end: 0x000000bf4eefff
DMAR: No ATSR found
DMAR:[DMA Read] Request device [03:00.1] fault addr fffc0000 
DMAR:[fault reason 02] Present bit in context entry is clear

grep IOMMU:

Intel-IOMMU: enabled
IOMMU 0: reg_base_addr fed91000 ver 1:0 cap c9008020660262 ecap f0105a
IOMMU 0 0xfed91000: using Queued invalidation
IOMMU: Setting RMRR:
IOMMU: Setting identity map for device 0000:00:1d.0 [0xbf4cc000 - 0xbf4eefff]
IOMMU: Setting identity map for device 0000:00:1a.0 [0xbf4cc000 - 0xbf4eefff]
IOMMU: Prepare 0-16MiB unity mapping for LPC
IOMMU: Setting identity map for device 0000:00:1f.0 [0x0 - 0xffffff]
Intel-IOMMU: enabled
IOMMU 0: reg_base_addr fed91000 ver 1:0 cap c9008020660262 ecap f0105a
IOMMU 0 0xfed91000: using Queued invalidation
IOMMU: Setting RMRR:
IOMMU: Setting identity map for device 0000:00:1d.0 [0xbf4cc000 - 0xbf4eefff]
IOMMU: Setting identity map for device 0000:00:1a.0 [0xbf4cc000 - 0xbf4eefff]
IOMMU: Prepare 0-16MiB unity mapping for LPC
IOMMU: Setting identity map for device 0000:00:1f.0 [0x0 - 0xffffff]

grep ata8:

ata8: SATA max UDMA/133 abar m2048@0xfa310000 port 0xfa310180 irq 48
ata8: SATA link up 6.0 Gbps (SStatus 133 SControl 300)
ata8.00: qc timeout (cmd 0xec)
ata8.00: failed to IDENTIFY (I/O error, err_mask=0x4)
ata8: SATA link up 6.0 Gbps (SStatus 133 SControl 300)
ata8.00: qc timeout (cmd 0xec)
ata8.00: failed to IDENTIFY (I/O error, err_mask=0x4)
ata8: limiting SATA link speed to 3.0 Gbps
ata8: SATA link up 6.0 Gbps (SStatus 133 SControl 320)
ata8.00: qc timeout (cmd 0xec)
ata8.00: failed to IDENTIFY (I/O error, err_mask=0x4)
ata8: SATA link up 6.0 Gbps (SStatus 133 SControl 320)
Comment 4 Korneliusz Jarzębski 2012-02-16 23:24:40 UTC
Created attachment 72419 [details]
config file / kernel 3.2.2

Kernel config
Comment 5 Korneliusz Jarzębski 2012-02-16 23:27:05 UTC
Created attachment 72420 [details]
Output of `lspci -knnv' command

Output of `lspci -knnv' command
Comment 6 Daniel Mayer 2012-03-26 12:21:35 UTC
I confirm this bug with kernel 3.2.6: same error with VT-d enabled in bios.

With mainboard "Asus Rampage III Gene", Z68, onboard Marvell; CPU Xeon L5520; 3x4GB Ram. Logs/Printouts follow this evening.
Comment 7 Daniel Mayer 2012-03-27 21:10:49 UTC
Created attachment 72733 [details]
kernel config

above bug confirmed with 3.2.13
Comment 8 Daniel Mayer 2012-03-27 21:11:50 UTC
Created attachment 72734 [details]
dmesg intel z68, asus rampage III gene, vt-d enable
Comment 9 Daniel Mayer 2012-03-27 21:12:27 UTC
Created attachment 72735 [details]
lspci, asus rampage III gene, z68, vt-d enable, 3.2.13
Comment 10 Daniel Mayer 2012-03-27 21:13:08 UTC
(In reply to comment #6)
> I confirm this bug with kernel 3.2.6: same error with VT-d enabled in bios.
> 
> With mainboard "Asus Rampage III Gene", Z68, onboard Marvell; CPU Xeon L5520;
> 3x4GB Ram. Logs/Printouts follow this evening.

Also confirmed for current kernel 3.2.13.
Comment 11 Andrew Cooks 2012-05-13 02:15:59 UTC
From a pdf file by Intel with title "Intel® Virtualization Technology for Directed I/O
Architecture Specification":
--snip--
3.6.1.4 PCI Express Devices Using Phantom Functions
To increase the maximum possible number of outstanding requests requiring completion, PCI Express allows a device to use function numbers not assigned to implemented functions to logically extend the Tag identifier. Unclaimed function numbers are referred to as Phantom Function Numbers (PhFN). A device reports its support for phantom functions through the Device Capability configuration register, and requires software to explicitly enable use of phantom functions through the Device Control configuration register.

Since the function number is part of the requester-id used to locate the context-entry for processing a DMA request, when assigning PCI Express devices with phantom functions enabled, software must program multiple context entries, each corresponding to the PhFN enabled for use by the device function. Each of these context-entries must be programmed identically to ensure the DMA requests with any of these requester-ids are processed identically.
--snip--

grep -ri phant says pci_regs.h knows about the capability, but it doesn't appear anywhere else in the kernel as far as I can see. Look for PCI_EXP_DEVCAP_PHANTOM and PCI_EXP_DEVCTL_PHANTOM.

Unfortunately, lspci indicates that the Marvell chip is not using phantom functions (lspci upload to follow), so at this point I can't tell if I'm on the right trail. 

Caveat lector: I don't have any previous experience with low-level PCI stuff.
Comment 12 Andrew Cooks 2012-05-13 02:24:21 UTC
Created attachment 73265 [details]
lspci output including device capabilities
Comment 13 Robert Cicconetti 2012-12-15 20:30:24 UTC
I'm seeing similar errors with AMD-Vi (AMD's IOMMU implementation) and a couple of Marvell 88SE9128-based cards, and can confirm that it is still present in 3.7.0 builds.

https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1089768
Comment 14 Stijn Tintel 2013-01-20 07:43:03 UTC
This problem happens here as well. Asus P9X79 WS, BIOS 3306, X79, i7-3930K. Running kernel 3.7.3. In addition to being unable to use the Marvel SATA controller ports, this causes a ~40s hang during boot.

I tried contacting Asus about this, as I think this could be fixed by a BIOS update, but they replied to me in horrible English they do not support Linux. I'll think twice before buying Asus again in the future, but it would be nice if a workaround could be implemented in the kernel.
Comment 15 Stijn Tintel 2013-01-20 07:43:58 UTC
Created attachment 91521 [details]
dmesg on Asus P9X79 WS, kernel 3.7.3
Comment 16 Stijn Tintel 2013-01-20 07:44:34 UTC
Created attachment 91531 [details]
lspci -knvv on Asus P9X79 WS, kernel 3.7.3
Comment 17 Stijn Tintel 2013-02-18 22:01:33 UTC
FWIW, I still have this issue with 3.7.8 and 3.8-rc7. BIOS update 3401 for the P9X79 WS didn't help. Additionally the hang during boot becomes worse (up to ~65 seconds), when a hard drive is connected. Since the drive is unusable anyway, I hacked the AHCI driver to ignore the Marvell controller. While no solution to this problem, at least my boot time is back to normal (<30s).
Comment 18 tradofox 2013-05-14 14:08:46 UTC
Same problem with Marvell 88SE9172 SATA Controller.
I have Gigabyte GA-Z77X-UD5H with two Marvell 88SE9172 SATA controllers and Intel E3-1245v2 CPU. VT-d is enabled. When running normal Debian 7 or >Ubuntu 12.04 i can see HDDs and SSDs connected to Marvell ports. After installing XenServer 6.1 and Xen Cloud Platform 1.6 - HDDs and SSDs are not detected, but lspci showing that Marvell 88SE9172 controllers are detected.
Comment 19 Li, ZhenHua 2013-09-27 08:52:58 UTC
The root cause of this bug seems to be : the device illegally accessed the memory that should be reserved for IOMMU module, and this changed iommu registers.
Comment 20 Bjorn Helgaas 2013-09-27 20:48:15 UTC
ZhenHua, can you elaborate on this?  Do you mean a device accessed the MMIO space used to program the IOMMU itself?  If so, how did you conclude that?  I doubt the IOMMU space is at address 0xfff00000.

Based on the following data:

  Paweł:
    DMAR:[DMA Read] Request device [0b:00.1] fault addr fff00000
    DMAR:[fault reason 02] Present bit in context entry is clear
    0b:00.0 [0106]: Marvell [1b4b:9123]
  Korneliusz:
    DMAR:[DMA Read] Request device [03:00.1] fault addr fffc0000
    DMAR:[fault reason 02] Present bit in context entry is clear
    03:00.0 [0106]: Marvell 88SE9123 SATA [1b4b:9123]
  Daniel:
    IOMMU identity map errors (assuming unrelated for now)
    DMAR:[DMA Read] Request device [01:00.1] fault addr fff00000
    DMAR:[fault reason 02] Present bit in context entry is clear
    01:00.0 [0106]: Marvell 88SE9123 SATA [1b4b:9123]
  Stijn:
    dmar: DMAR:[DMA Read] Request device [07:00.1] fault addr fff00000
    DMAR:[fault reason 02] Present bit in context entry is clear
    07:00.0 0106: 1b4b:9130 (rev 11) (prog-if 01 [AHCI 1.0])

in each case the IOMMU saw a DMA read to an address that wasn't mapped for the requesting device.  In each case, the requester is function .1, the kernel doesn't know about a .1 function, and there is a Marvell 912x SATA control at the corresponding .0 function.

Andrew's Phantom Function theory seems like a good direction to explore.  Maybe these devices incorrectly report Phantom Function support in the Device Capability & Control, and we just need some sort of quirk to work around that.

It would be interesting to know whether the .0 Marvell function has valid IOMMU mappings for the fault addresses (0xfff00000 or 0xfffc0000), or whether there is really anything at those addresses.  They seem like dubious targets for DMA.
Comment 21 Li, ZhenHua 2013-09-29 07:30:39 UTC
Hi guys,

    1. Since there are only lspci running in "intel_iommu=on", could you paste lspci -vvv and lspci -t, lspci -n when intel_iommu is not set to on?
    
Thanks
ZhenHua
Comment 22 Andrew Cooks 2013-09-30 05:02:30 UTC
Created attachment 109981 [details]
Patch with quirk for incorrect PCI requester IDs

Here's a patch that provides a quirk for what I believe to be the root cause: devices that use incorrect PCI requester IDs, including Marvell 91xx controllers.

Various revisions have been sent to LKML and IOMMU-list in the past and a number of people have reported that it solved their problem and I've been running this on two boxes for months. I'm not sure why it hasn't been accepted.

Note that there are several devices that suffer from the same affliction, i.e., using incorrect PCI requester IDs in when their transactions. The Marvell devices use both xx:yy.0 and xx:yy.1, possibly related to the SATA port number. Other devices, like Ricoh's R5C832 PCIe IEEE 1394 Controller commonly found in T410 and T420 Thinkpads use a single incorrect requester ID.

Please try this patch and let me know if it works for you.
Comment 23 Li, ZhenHua 2013-09-30 07:11:03 UTC
Each context_entry has a present bit. If a context entry is used for a device, but its present bit is not set to 1, an error with fault number 2 will occur. 

I tested on my PC,  comment a line "context_set_present(context);" will cause the same error. So I guess the devices that has the error may be using a context entry with present bit 0.
Comment 24 Li, ZhenHua 2013-09-30 07:21:20 UTC
See this line in file drivers/iommu/intel-iommu.c , function" static int  domain_context_mapping_one( )":
     context_set_present(context);

It is used to set the present bit of the context entry. comment this line, you will get the error.
Comment 25 Paweł Żak 2013-09-30 12:02:05 UTC
(In reply to Andrew Cooks from comment #22)

> Please try this patch and let me know if it works for you.

It does remove the lag and there are no longer error entries in system log. I also tried to connect a drive to this controller and it worked too. It appears then that this patch did the job.
Comment 26 Martin Öhrling 2013-12-26 23:12:45 UTC
(In reply to Andrew Cooks from comment #22)
> Created attachment 109981 [details]
> Patch with quirk for incorrect PCI requester IDs
> 

I'm trying to enable the iommu on a gigabyte z87x-ud5h board and a VIA firewire controller and a Marvell 88SE9230 SATA controller is misbehaving. I'll add a couple of entries to your lists when I'm done testing.

I've found one thing that looks a bit strange in your patch. The pci_requester() function is using the devfn member to break the for loop. Take a close look at the last entry for "Mellanox 26428" in the pci_dev_dma_source_map. The devfn members value will break the loop. That entry, and any entry later appended to the list, will not be evaluated.
Comment 27 Andrew Cooks 2014-01-09 23:08:50 UTC
(In reply to Martin Öhrling from comment #26)
> (In reply to Andrew Cooks from comment #22)
> > Created attachment 109981 [details]
> > Patch with quirk for incorrect PCI requester IDs
> > 
> 
> I'm trying to enable the iommu on a gigabyte z87x-ud5h board and a VIA
> firewire controller and a Marvell 88SE9230 SATA controller is misbehaving.
> I'll add a couple of entries to your lists when I'm done testing.

Did you have any success? Could you provide more information, please?

> I've found one thing that looks a bit strange in your patch. The
> pci_requester() function is using the devfn member to break the for loop.
> Take a close look at the last entry for "Mellanox 26428" in the
> pci_dev_dma_source_map. The devfn members value will break the loop. That
> entry, and any entry later appended to the list, will not be evaluated.

Yes, it's definitely broken, but trivial to fix. Thanks for reporting it.
Comment 28 Martin Öhrling 2014-01-11 14:57:25 UTC
> 
> Did you have any success? Could you provide more information, please?
> 

I've had some success but I also ran into bug 44881 (pcie to pci bridge shown
as a pci to pci bridge causing pci_find_upstream_pcie_bridge() to fail). The VIA controller turned out to be a pci device connected to the bridge. DMA request where sent from the bridge device id and that was not mapped into the iommu table.

The 9230 controller from marvell sent dma requests from function 0 and 1. I got rid of all "Present bit in context entry is clear"-errors when I inserted this entry into pci_dev_dma_multi_source_map[]:

       { PCI_VENDOR_ID_MARVELL_EXT, 0x9230, (1<<0)|(1<<1)},

I'm not considering this to be fully verified since I still have problems to boot the system. Next step is to apply suggested patches for bug 44881. This turned out to be more work than I expected...
Comment 29 Martin Öhrling 2014-01-11 22:58:59 UTC
I'm still getting one errors reported by the iommu. My best guess is that this is a bios bug:

[    0.675780] dmar: DRHD: handling fault status reg 2
[    0.676191] dmar: DMAR:[DMA Read] Request device [06:00.0] fault addr ac0a7000 
[    0.676191] DMAR:[fault reason 06] PTE Read access is not set

It's unlikely that the ahci driver fails to map memory for dma. This is the memory map entry reported by bios (from dmesg):

[    0.000000] e820: BIOS-provided physical RAM map:
...
[    0.000000] BIOS-e820: [mem 0x00000000a77e1000-0x00000000b9ecffff] usable
...

RMRR ranges:

[    0.047843] dmar: Host address width 39
[    0.047845] dmar: DRHD base: 0x000000fed90000 flags: 0x0
[    0.047851] dmar: IOMMU 0: reg_base_addr fed90000 ver 1:0 cap c0000020660462 ecap f0101a
[    0.047852] dmar: DRHD base: 0x000000fed91000 flags: 0x1
[    0.047856] dmar: IOMMU 1: reg_base_addr fed91000 ver 1:0 cap d2008020660462 ecap f010da
[    0.047857] dmar: RMRR base: 0x000000ba063000 end: 0x000000ba06ffff
[    0.047858] dmar: RMRR base: 0x000000bb800000 end: 0x000000bf9fffff

...

[    0.470870] IOMMU 0 0xfed90000: using Queued invalidation
[    0.470871] IOMMU 1 0xfed91000: using Queued invalidation
[    0.470873] IOMMU: Setting RMRR:
[    0.470881] IOMMU: Setting identity map for device 0000:00:02.0 [0xbb800000 - 0xbf9fffff]
[    0.471198] IOMMU: Setting identity map for device 0000:00:1d.0 [0xba063000 - 0xba06ffff]
[    0.471219] IOMMU: Setting identity map for device 0000:00:1a.0 [0xba063000 - 0xba06ffff]
[    0.471236] IOMMU: Setting identity map for device 0000:00:14.0 [0xba063000 - 0xba06ffff]
[    0.471249] IOMMU: Prepare 0-16MiB unity mapping for LPC
[    0.471255] IOMMU: Setting identity map for device 0000:00:1f.0 [0x0 - 0xffffff]


I can't find anything in the dmesg output that would indicate that the kernel has been made aware of a reserved memory range that includes the offending address.

As far as I can tell, the patch resolved my first issue, I no longer get any present bit errors. My current issue can't be a kernel bug.
Comment 30 Andrew Cooks 2014-01-31 14:08:30 UTC
Created attachment 124001 [details]
Patch with quirk for incorrect PCIe requester IDs.

Changes:
* Include various bit shifting and masking fixes by George Spelvin
* Fix pci_requester() using wrong loop condition, reported by Martin Öhrling   
* Expand list of quirked device ids
* Attempt to include support for AMD (needs testing)

If you'd like your name included as Reported-by or Tested-by, let me know.
Comment 31 Thomas Kuther 2014-02-03 08:07:56 UTC
I can confirm that Andrew's patch from #30 fixes the issue on my AMD based Gigabyte 990FXA-UD5 board. Both Marvell 88SE9172 controllers (internal and eSATA) are working now. Thanks!
Comment 32 MvW 2014-02-05 18:41:31 UTC
Can anyone please direct me on how to apply this patch to my kernel?
(3.11.0-15-generic #25-Ubuntu SMP Thu Jan 30 17:22:01 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux)
Possibly using a deb package?

Excuse me for being a complete noob here. ;)
Comment 33 Andreas Schrägle 2014-02-07 15:39:40 UTC
(In reply to MvW from comment #32)
> Can anyone please direct me on how to apply this patch to my kernel?
> (3.11.0-15-generic #25-Ubuntu SMP Thu Jan 30 17:22:01 UTC 2014 x86_64 x86_64
> x86_64 GNU/Linux)
> Possibly using a deb package?
> 
> Excuse me for being a complete noob here. ;)

The kernelnewbies wiki has articels on this stuff, e.g. http://kernelnewbies.org/KernelBuild

(In reply to Andrew Cooks from comment #30)
> Created attachment 124001 [details]
> Patch with quirk for incorrect PCIe requester IDs.
> 
> Changes:
> * Include various bit shifting and masking fixes by George Spelvin
> * Fix pci_requester() using wrong loop condition, reported by Martin Öhrling
> 
> * Expand list of quirked device ids
> * Attempt to include support for AMD (needs testing)
> 
> If you'd like your name included as Reported-by or Tested-by, let me know.

I'd like to report that this kind of works for me with kernel 3.13.2 on my ASRock 990FX Extreme4 board, which (sometimes) has a 88SE91A0 controller. I added its PCI-ID (1b4b:91a0) like this: "{ PCI_VENDOR_ID_MARVELL_EXT, 0x91a0, (1<<0)|(1<<1)}," to the patch.
This controller really confuses me, when it works it shows up as "02:00.0 IDE interface [0101]: Marvell Technology Group Ltd. 88SE91A0 SATA 6Gb/s Controller [1b4b:91a0] (rev 12)", but sometimes it's "02:00.0 SATA controller [0106]: Marvell Technology Group Ltd. Device [1b4b:9122] (rev 12)
02:00.1 IDE interface [0101]: Marvell Technology Group Ltd. 88SE912x IDE Controller [1b4b:91a4] (rev 12)" and doesn't work. I think (hope) changing some options in my UEFI like disabling fast boot, the boot failure guard and activating the "make devices on the second sata controller bootable" ROM made it so that it always boots as a 91a0, but I'm not really convinced yet.

When it works, eSata and passthrough with vfio-pci both work. I don't know if this is the right place for this, but Linux doesn't recognize this controller at all, so I'm doing "echo 1b4b 91a0 > /sys/bus/pci/drivers/ahci/new_id" for testing at the moment. Do I have to talk to the AHCI maintainers to get this fixed or how would I go about this?
Comment 34 Li, ZhenHua 2014-02-10 02:07:23 UTC
(In reply to MvW from comment #32)
> Can anyone please direct me on how to apply this patch to my kernel?
> (3.11.0-15-generic #25-Ubuntu SMP Thu Jan 30 17:22:01 UTC 2014 x86_64 x86_64
> x86_64 GNU/Linux)
> Possibly using a deb package?
> 
> Excuse me for being a complete noob here. ;)

If you want to apply a kernel patch, you need to checkout the kernel source with git and send mails to kernel.org community. You can google it to get more information.
Comment 35 MvW 2014-02-10 14:20:40 UTC
(In reply to Andreas Schrägle from comment #33)
> The kernelnewbies wiki has articels on this stuff, e.g.
> http://kernelnewbies.org/KernelBuild

(In reply to Li, ZhenHua from comment #34)
> If you want to apply a kernel patch, you need to checkout the kernel source
> with git and send mails to kernel.org community. You can google it to get
> more information.

Thanks for the pointers, I'll be looking in to it shortly.
Will this fix be included in the 3.14 release, will it be backported to 3.11?
Comment 36 Andrew Cooks 2014-02-10 15:35:17 UTC
(In reply to MvW from comment #35)
> Will this fix be included in the 3.14 release,

No.

> will it be backported to 3.11?

No.

The patch I created needs work.

* The AMD-specific code doesn't really cope with one-to-many mappings (despite what the current patch suggests), because other parts of the AMD IOMMU driver needs work to support this.

* It needs to be restructured to provide a common interface for finding all requester ids and calling a callback for each.

It's unlikely that I'll be able to do the mentioned work soon and there are probably more changes required that I don't know about. I'll keep posting improvements when I can.

Of course there might be other people working on this and a better patch could appear and be included at any time.
Comment 37 MvW 2014-02-13 11:55:58 UTC
(In reply to Andrew Cooks from comment #36)
> (In reply to MvW from comment #35)
> > Will this fix be included in the 3.14 release,
> 
> No.
> 
> > will it be backported to 3.11?
> 
> No.
> 
> The patch I created needs work.
> 
> * The AMD-specific code doesn't really cope with one-to-many mappings
> (despite what the current patch suggests), because other parts of the AMD
> IOMMU driver needs work to support this.
> 
> * It needs to be restructured to provide a common interface for finding all
> requester ids and calling a callback for each.
> 
> It's unlikely that I'll be able to do the mentioned work soon and there are
> probably more changes required that I don't know about. I'll keep posting
> improvements when I can.
> 
> Of course there might be other people working on this and a better patch
> could appear and be included at any time.

Does this also apply to the Intel IOMMU implementation which this report is about? I'm having this issue with a Asus P9X79 Deluxe using the same SATA chip (88SE9128) as the bug reporter.

I'm unfortunately unable to fix this myself, so perhaps instead of patching my kernel until this issue has been fixed upstream, I should consider buying another SATA controller since this won't get fixed for a while to come right?
Comment 38 Andrew Cooks 2014-02-13 12:40:33 UTC
(In reply to MvW from comment #37)
> Does this also apply to the Intel IOMMU implementation which this report is
> about?

Yes, in my opinion.
 
> I'm unfortunately unable to fix this myself, so perhaps instead of patching
> my kernel until this issue has been fixed upstream, I should consider buying
> another SATA controller since this won't get fixed for a while to come right?

I encourage you to apply the patch. It may need improvement to be acceptable to the mainline developers, but it does work and will be maintained until it is either acceptable for the mainline kernel or until someone else provides an acceptable patch.

If there are other reasons why you can't use the patch, let's try to address those.
Comment 39 MvW 2014-02-13 13:20:09 UTC
(In reply to Andrew Cooks from comment #38)
> Yes, in my opinion.

What I meant to say was if these AMD specific issues would entitle a different or extension to this patch, making the current implementation without the AMD stuff allegeable for inclusion in the mainline kernel to help all of us Intel based users.

> I encourage you to apply the patch. It may need improvement to be acceptable
> to the mainline developers, but it does work and will be maintained until it
> is either acceptable for the mainline kernel or until someone else provides
> an acceptable patch.
> 
> If there are other reasons why you can't use the patch, let's try to address
> those.

I'm inclined to do so, but I also have to consider that this particular machine has to run 24/7 and be upgraded with the necessary security updates as the are released.  Re-patching the kernel on every kernel update for an unforeseeable future would imply too much effort compared to buying another simple SATA controller to alleviate this issue altogether.

I do however, want to be of help, even though it only consists of testing the solution at hand, but there are some concerns that need to be addressed.
Is there any (extra) risk of data loss with this patch?
Is there an easy way to apply this patch across kernel updates automatically?
Comment 40 Shawn 2014-02-15 16:52:52 UTC
I can unfortunately confirm that Marvell 88SE92xx series chips also suffer from this issue.  My motherboard had a 9120 that didn't work because of this and I needed the ports, so I purchased a 9230-based PCIe add-on card and am experiencing the same symptoms.

(I'm actually not a member of your community: I'm an ESXi user who was pointed to this thread from here: http://www.v-front.de/2013/11/how-to-make-your-unsupported-sata-ahci.html and wanted to share my experience so others wouldn't need to repeat my mistake of buying a 92xx-based device as a workaround.)
Comment 41 michael 2014-03-14 21:58:53 UTC
I can confirm that the patch makes the 88SE9172 usable, attached HD seems to work fine. However there still seems to be a problem with accessing the option ROM:

# lspci | grep Marvell
08:00.0 SATA controller: Marvell Technology Group Ltd. 88SE9172 SATA 6Gb/s Controller (rev 11)

# cd /sys/devices/pci0000\:00/0000\:00\:1c.7/0000\:08\:00.0/
# echo 1 > rom 
# dd if=rom of=/tmp/rom_dump
dd: error reading ‘rom’: Input/output error
0+0 records in
0+0 records out
0 bytes (0 B) copied, 0,000123135 s, 0,0 kB/s
# dmesg
[...]
[  998.737809] ahci 0000:08:00.0: Invalid ROM contents

I stumbled on this issue when trying to passthrough the Marvell controller to a virtual machine using qemu/kvm. When trying to use pci-assign (or the newer vfio-pci) qemu complains about missing optionrom and will not find any connected drive.
Comment 42 Andreas Schrägle 2014-03-24 21:38:17 UTC
(In reply to michael from comment #41)
> I can confirm that the patch makes the 88SE9172 usable, attached HD seems to
> work fine. However there still seems to be a problem with accessing the
> option ROM:
> 
> # lspci | grep Marvell
> 08:00.0 SATA controller: Marvell Technology Group Ltd. 88SE9172 SATA 6Gb/s
> Controller (rev 11)
> 
> # cd /sys/devices/pci0000\:00/0000\:00\:1c.7/0000\:08\:00.0/
> # echo 1 > rom 
> # dd if=rom of=/tmp/rom_dump
> dd: error reading ‘rom’: Input/output error
> 0+0 records in
> 0+0 records out
> 0 bytes (0 B) copied, 0,000123135 s, 0,0 kB/s
> # dmesg
> [...]
> [  998.737809] ahci 0000:08:00.0: Invalid ROM contents
> 
> I stumbled on this issue when trying to passthrough the Marvell controller
> to a virtual machine using qemu/kvm. When trying to use pci-assign (or the
> newer vfio-pci) qemu complains about missing optionrom and will not find any
> connected drive.

I can't access the option rom on my controller either:
02:00.0 IDE interface: Marvell Technology Group Ltd. 88SE91A0 SATA 6Gb/s Controller (rev 12)

However, the device works as well when passed through with vfio-pci as it does on the host (the linux ahci driver works but doesn't recognize it and windows needs a special driver).
Comment 43 Tom Wijsman 2014-04-02 15:59:58 UTC
Similar bug downstream: https://bugs.gentoo.org/show_bug.cgi?id=497630
Comment 44 Joshua Scoggins 2014-05-02 05:29:13 UTC
Created attachment 134661 [details]
Logs and assorted information with respect to IOMMU issues

I have had this issue with my Asus Z9 PE-D8 WS since I bought it back in 2012.
I have a Marvell 88SE9230 PCIe SATA 6Gb/s Controller and am running Gentoo 
Linux AMD64 with kernel 3.10.25 (later versions of the kernel do not agree with
my GTX670 and the screen is blank on boot) and the patch provided in this PR.
Unfortunately, it does not work and I am still getting the same error
as before:

dmar: DRHD: handling fault status reg 2
DMAR: [DMA Read] Request device [0b:00.o] fault addr 0
DMAR: [fault reason 6] PTE read access is not set
....
ata[6-9] COMRESET Failed (errno=-16)

This error leads to my marvell controller not being initialized and the disks
on it fall out of the software raid 6 I'm running (four of the seven disks were
always active on the intel controller so no data loss occurred at any point
during this).

So I decided to go all out and try several different combinations of options in
the BIOS and kernel options with this patch applied. The options that I tried 
different combinations of were:

1) Intel VT-d  [BIOS]
2) Address Translation Services [BIOS - Sub option if Intel VT-d is enabled]
3) Coherency Support [BIOS - Sub option if Intel VT-d is enabled]
4) Native Command Queueing [Kernel cmd line, disabled with libata.force=noncq]

NOTE that if VT-d is disabled then I have no issues so I am only showing the
default configuration at the top of the following table. Interpret the binary
digits as having the above features turned on or off [read it from left to
right]:

0001 : Works [Software IOMMU is used]
1111 : Fails [ata6 and ata7 had COMRESET failures]
1110 : Fails [ata7 has COMRESET failures (ata6 is absent)]
1011 : Fails [ata7 and ata8 have COMRESET failures]
1010 : Fails [ata7 and ata8 have COMRESET failures]
1101 : Fails [ata8 and ata7 (order changed) have COMRESET failures, faster]
1100 : Fails [ata9 and ata7 (different device) have COMRESET failures, faster]
1001 : Fails [ata7 and ata8 have COMRESET failures]
1000 : Fails [ata7 and ata8 have COMRESET failures]

I have also attached a file containing the contents of /var/log/messages for
different runs. I have annotated each run in the file so that it can be refered
to some what easily. I have also attached the results of lspci, I will warn you
that it is quite large as I have 150 PCI devices on my bus!

While this may be irrelevant, I noticed that the problematic device (ata14)
actually does exist and is registered in Windows 7 Professional 64-bit as a
"Marvell Console" device which is on port 14. I see references to it not being
identifiable in the annotated runs document I have attached.
Comment 45 Alex Williamson 2014-05-02 05:43:59 UTC
I've proposed a patch series that should resolve this:

https://lkml.org/lkml/2014/5/1/290

The patches applied to v3.15-rc3 can also be found here:

git://github.com/awilliam/linux-vfio.git dma-alias

Please test and provide feedback here or to the list.  If your controller is not one of the ones listed in patch 04/13, please add a new entry for it and report it here.  Thanks
Comment 46 Andreas Schrägle 2014-05-02 14:44:43 UTC
@Alex Williamson: Your patches seem to work fine for me, with this addition for my controller.

diff --git a/drivers/pci/quirks.c b/drivers/pci/quirks.c
index ea55b0f..f0d8b11 100644
--- a/drivers/pci/quirks.c
+++ b/drivers/pci/quirks.c
@@ -3366,6 +3366,8 @@ DECLARE_PCI_FIXUP_HEADER(PCI_VENDOR_ID_MARVELL_EXT, 0x9123,
 /* https://bugzilla.kernel.org/show_bug.cgi?id=42679#c14 */
 DECLARE_PCI_FIXUP_HEADER(PCI_VENDOR_ID_MARVELL_EXT, 0x9130,
                         quirk_dma_func1_alias);
+DECLARE_PCI_FIXUP_HEADER(PCI_VENDOR_ID_MARVELL_EXT, 0x91a0,
+                        quirk_dma_func1_alias);
 /* https://bugs.gentoo.org/show_bug.cgi?id=497630 */
 DECLARE_PCI_FIXUP_HEADER(PCI_VENDOR_ID_JMICRON,
                         PCI_DEVICE_ID_JMICRON_JMB388_ESD,
Comment 47 Tobias N 2014-05-02 22:23:37 UTC
@Alex Williamson: Your Kernel from git is working on Gigabyte GA-X79-UP4 Rev1.1 BIOS F7 in Debian Testing Jessie.

05:00.0 SATA controller [0106]: Marvell Technology Group Ltd. 88SE9172 SATA 6Gb/s Controller [1b4b:9172] (rev 11)
06:00.0 SATA controller [0106]: Marvell Technology Group Ltd. 88SE9172 SATA 6Gb/s Controller [1b4b:9172] (rev 11)
07:00.0 SATA controller [0106]: Marvell Technology Group Ltd. 88SE9172 SATA 6Gb/s Controller [1b4b:9172] (rev 11)

I added DECLARE_PCI_FIXUP_HEADER(PCI_VENDOR_ID_MARVELL_EXT, 0x9172, quirk_dma_func1_alias); to /drivers/pci/quirks.c

The Harddisk is accessible with vt-d active and its possible to vfio passthru x-vga in qemu 1.7.

Thank you for your Work.
Comment 48 Tobias N 2014-05-02 22:30:09 UTC
Created attachment 134751 [details]
lspcis for Gigabyte GA-X79-UP4 Rev. 1.1
Comment 49 Joshua Scoggins 2014-05-03 05:16:35 UTC
Created attachment 134861 [details]
lspci and logs from tycho (asus z9pe-d8 ws, X79 + C600 chipset) running 3.15-rc3

@Alex Williamson: I appreciate the work you've done, unforunately the patches
do not affect the DMAR error I'm getting with the Marvell 88SE9230 even after
I added an entry to drivers/pci/quirks.c:

diff --git a/drivers/pci/quirks.c b/drivers/pci/quirks.c
index ea55b0f..bfe9c8d 100644
--- a/drivers/pci/quirks.c
+++ b/drivers/pci/quirks.c
@@ -3366,6 +3366,8 @@ DECLARE_PCI_FIXUP_HEADER(PCI_VENDOR_ID_MARVELL_EXT, 0x9123,
 /* https://bugzilla.kernel.org/show_bug.cgi?id=42679#c14 */
 DECLARE_PCI_FIXUP_HEADER(PCI_VENDOR_ID_MARVELL_EXT, 0x9130,
 			 quirk_dma_func1_alias);
+DECLARE_PCI_FIXUP_HEADER(PCI_VENDOR_ID_MARVELL_EXT, 0x9230,
+			 quirk_dma_func1_alias);
 /* https://bugs.gentoo.org/show_bug.cgi?id=497630 */
 DECLARE_PCI_FIXUP_HEADER(PCI_VENDOR_ID_JMICRON,
 			 PCI_DEVICE_ID_JMICRON_JMB388_ESD,

I have attached the output lspci and verbose output from the failed run to this
comment. I hope this helps. Once again, thanks for all the work.
Comment 50 Alex Williamson 2014-05-03 05:37:40 UTC
(In reply to Joshua Scoggins from comment #49)
> Created attachment 134861 [details]
> lspci and logs from tycho (asus z9pe-d8 ws, X79 + C600 chipset) running
> 3.15-rc3
> 
> @Alex Williamson: I appreciate the work you've done, unforunately the patches
> do not affect the DMAR error I'm getting with the Marvell 88SE9230 even after
> I added an entry to drivers/pci/quirks.c:
> 
> diff --git a/drivers/pci/quirks.c b/drivers/pci/quirks.c
> index ea55b0f..bfe9c8d 100644
> --- a/drivers/pci/quirks.c
> +++ b/drivers/pci/quirks.c
> @@ -3366,6 +3366,8 @@ DECLARE_PCI_FIXUP_HEADER(PCI_VENDOR_ID_MARVELL_EXT,
> 0x9123,
>  /* https://bugzilla.kernel.org/show_bug.cgi?id=42679#c14 */
>  DECLARE_PCI_FIXUP_HEADER(PCI_VENDOR_ID_MARVELL_EXT, 0x9130,
>                        quirk_dma_func1_alias);
> +DECLARE_PCI_FIXUP_HEADER(PCI_VENDOR_ID_MARVELL_EXT, 0x9230,
> +                      quirk_dma_func1_alias);
>  /* https://bugs.gentoo.org/show_bug.cgi?id=497630 */
>  DECLARE_PCI_FIXUP_HEADER(PCI_VENDOR_ID_JMICRON,
>                        PCI_DEVICE_ID_JMICRON_JMB388_ESD,
> 
> I have attached the output lspci and verbose output from the failed run to
> this
> comment. I hope this helps. Once again, thanks for all the work.

Hi Joshua,

I don't actually see any IOMMU faults in any of your logs, either the original or updated.  Only your update in comment 44 show a DMAR fault.  Could you try to record a log that includes the IOMMU faults you're seeing and not just the SATA port probing failure?  Thanks
Comment 51 Joshua Scoggins 2014-05-03 05:59:23 UTC
Created attachment 134871 [details]
More Verbose DMESG output for tycho (z9pe-d8 WS, x79 + C600)

Sorry about that, I thought I saw the DMAR error in those logs.....
I enabled heavy debug mode this time and confirmed I saw the DMAR errors and have attached that log.
Comment 52 Alex Williamson 2014-05-03 15:03:35 UTC
(In reply to Joshua Scoggins from comment #51)
> Created attachment 134871 [details]
> More Verbose DMESG output for tycho (z9pe-d8 WS, x79 + C600)
> 
> Sorry about that, I thought I saw the DMAR error in those logs.....
> I enabled heavy debug mode this time and confirmed I saw the DMAR errors and
> have attached that log.

Your log doesn't seem to match the issues others are having.  There's only a single DMAR fault:

[    1.887994] dmar: DRHD: handling fault status reg 2
[    1.888349] dmar: DMAR:[DMA Read] Request device [0b:00.0] fault addr 0 
DMAR:[fault reason 06] PTE Read access is not set

This is a read access to physical address 0x0 from function 0.  It seems valid for the IOMMU to block this, the driver can possibly have mapped a buffer at 0x0.  Later the log shows problems probing these channels:

[    7.193298] ata8.00: qc timeout (cmd 0xec)
[    7.194320] ata8.00: failed to IDENTIFY (I/O error, err_mask=0x4)
[    7.197283] ata14.00: qc timeout (cmd 0xa1)
[    7.198266] ata14.00: failed to IDENTIFY (I/O error, err_mask=0x4)
[    7.237279] ata7: link is slow to respond, please be patient (ready=0)
[    7.238300] ata9: link is slow to respond, please be patient (ready=0)
[    7.513247] ata8: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
[    7.517218] ata14: SATA link up 1.5 Gbps (SStatus 113 SControl 300)
[   11.884142] ata9: COMRESET failed (errno=-16)
[   11.885130] ata7: COMRESET failed (errno=-16)
[   17.242784] ata9: link is slow to respond, please be patient (ready=0)
[   17.243720] ata7: link is slow to respond, please be patient (ready=0)
[   17.510746] ata8.00: qc timeout (cmd 0xec)
[   17.511561] ata8.00: failed to IDENTIFY (I/O error, err_mask=0x4)
[   17.512361] ata8: limiting SATA link speed to 1.5 Gbps
[   17.514750] ata14.00: qc timeout (cmd 0xa1)
[   17.515498] ata14.00: failed to IDENTIFY (I/O error, err_mask=0x4)
[   17.516231] ata14: limiting SATA link speed to 1.5 Gbps
[   17.830691] ata8: SATA link up 3.0 Gbps (SStatus 123 SControl 310)
[   17.834674] ata14: SATA link up 1.5 Gbps (SStatus 113 SControl 310)
[   21.889672] ata7: COMRESET failed (errno=-16)
[   21.890480] ata9: COMRESET failed (errno=-16)
[   27.248351] ata9: link is slow to respond, please be patient (ready=0)
[   27.249142] ata7: link is slow to respond, please be patient (ready=0)
[   47.823276] ata8.00: qc timeout (cmd 0xec)
[   47.823988] ata8.00: failed to IDENTIFY (I/O error, err_mask=0x4)
[   47.827246] ata14.00: qc timeout (cmd 0xa1)
[   47.827960] ata14.00: failed to IDENTIFY (I/O error, err_mask=0x4)
[   48.143219] ata8: SATA link up 3.0 Gbps (SStatus 123 SControl 310)
[   48.147218] ata14: SATA link up 1.5 Gbps (SStatus 113 SControl 310)
[   56.921034] ata7: COMRESET failed (errno=-16)
[   56.921773] ata7: limiting SATA link speed to 1.5 Gbps
[   56.922548] ata9: COMRESET failed (errno=-16)
[   56.923265] ata9: limiting SATA link speed to 1.5 Gbps
[   61.943795] ata9: COMRESET failed (errno=-16)
[   61.944518] ata9: reset failed, giving up
[   61.945297] ata7: COMRESET failed (errno=-16)
[   61.946022] ata7: reset failed, giving up

But, there are no further DMAR faults.  Can you confirm whether adding the 0x9230 ID for your device changes anything?  If the device is making a stray access to address 0x0, it may simply be incompatible with an IOMMU, in which case it should probably be assigned to a passthrough domain using the iommu=pt option.
Comment 53 Joshua Scoggins 2014-05-03 15:53:26 UTC
(In reply to Alex Williamson from comment #52)
> (In reply to Joshua Scoggins from comment #51)
> > Created attachment 134871 [details]
> > More Verbose DMESG output for tycho (z9pe-d8 WS, x79 + C600)
> > 
> > Sorry about that, I thought I saw the DMAR error in those logs.....
> > I enabled heavy debug mode this time and confirmed I saw the DMAR errors
> and
> > have attached that log.
> 
> Your log doesn't seem to match the issues others are having.  There's only a
> single DMAR fault:
> 
> [    1.887994] dmar: DRHD: handling fault status reg 2
> [    1.888349] dmar: DMAR:[DMA Read] Request device [0b:00.0] fault addr 0 
> DMAR:[fault reason 06] PTE Read access is not set
> 
> This is a read access to physical address 0x0 from function 0.  It seems
> valid for the IOMMU to block this, the driver can possibly have mapped a
> buffer at 0x0.  Later the log shows problems probing these channels:
> 
> [    7.193298] ata8.00: qc timeout (cmd 0xec)
> [    7.194320] ata8.00: failed to IDENTIFY (I/O error, err_mask=0x4)
> [    7.197283] ata14.00: qc timeout (cmd 0xa1)
> [    7.198266] ata14.00: failed to IDENTIFY (I/O error, err_mask=0x4)
> [    7.237279] ata7: link is slow to respond, please be patient (ready=0)
> [    7.238300] ata9: link is slow to respond, please be patient (ready=0)
> [    7.513247] ata8: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
> [    7.517218] ata14: SATA link up 1.5 Gbps (SStatus 113 SControl 300)
> [   11.884142] ata9: COMRESET failed (errno=-16)
> [   11.885130] ata7: COMRESET failed (errno=-16)
> [   17.242784] ata9: link is slow to respond, please be patient (ready=0)
> [   17.243720] ata7: link is slow to respond, please be patient (ready=0)
> [   17.510746] ata8.00: qc timeout (cmd 0xec)
> [   17.511561] ata8.00: failed to IDENTIFY (I/O error, err_mask=0x4)
> [   17.512361] ata8: limiting SATA link speed to 1.5 Gbps
> [   17.514750] ata14.00: qc timeout (cmd 0xa1)
> [   17.515498] ata14.00: failed to IDENTIFY (I/O error, err_mask=0x4)
> [   17.516231] ata14: limiting SATA link speed to 1.5 Gbps
> [   17.830691] ata8: SATA link up 3.0 Gbps (SStatus 123 SControl 310)
> [   17.834674] ata14: SATA link up 1.5 Gbps (SStatus 113 SControl 310)
> [   21.889672] ata7: COMRESET failed (errno=-16)
> [   21.890480] ata9: COMRESET failed (errno=-16)
> [   27.248351] ata9: link is slow to respond, please be patient (ready=0)
> [   27.249142] ata7: link is slow to respond, please be patient (ready=0)
> [   47.823276] ata8.00: qc timeout (cmd 0xec)
> [   47.823988] ata8.00: failed to IDENTIFY (I/O error, err_mask=0x4)
> [   47.827246] ata14.00: qc timeout (cmd 0xa1)
> [   47.827960] ata14.00: failed to IDENTIFY (I/O error, err_mask=0x4)
> [   48.143219] ata8: SATA link up 3.0 Gbps (SStatus 123 SControl 310)
> [   48.147218] ata14: SATA link up 1.5 Gbps (SStatus 113 SControl 310)
> [   56.921034] ata7: COMRESET failed (errno=-16)
> [   56.921773] ata7: limiting SATA link speed to 1.5 Gbps
> [   56.922548] ata9: COMRESET failed (errno=-16)
> [   56.923265] ata9: limiting SATA link speed to 1.5 Gbps
> [   61.943795] ata9: COMRESET failed (errno=-16)
> [   61.944518] ata9: reset failed, giving up
> [   61.945297] ata7: COMRESET failed (errno=-16)
> [   61.946022] ata7: reset failed, giving up
> 
> But, there are no further DMAR faults.  Can you confirm whether adding the
> 0x9230 ID for your device changes anything?  If the device is making a stray
> access to address 0x0, it may simply be incompatible with an IOMMU, in which
> case it should probably be assigned to a passthrough domain using the
> iommu=pt option.

By adding 0x9230 ID for my device do you mean to quirks.c as others have done? If so then I already did that with no change. 

Adding the iommu=pt option to the kernel command line does fix the dmar error but is that all that is necessary? When this option is added do all devices get passed through or just those incompatible with the MMU? The kernel command line options documentation is sparse on what this does.

And if adding iommu=pt is all I need to do then I appreciate the work (and I apologize if I sent you on a wild goose chase trying to fix this) as my system boots much faster and feels generally more responsive.
Comment 54 Alex Williamson 2014-05-03 16:43:12 UTC
(In reply to Joshua Scoggins from comment #53)
> (In reply to Alex Williamson from comment #52)
> > (In reply to Joshua Scoggins from comment #51)
> > > Created attachment 134871 [details]
> > > More Verbose DMESG output for tycho (z9pe-d8 WS, x79 + C600)
> > > 
> > > Sorry about that, I thought I saw the DMAR error in those logs.....
> > > I enabled heavy debug mode this time and confirmed I saw the DMAR errors
> and
> > > have attached that log.
> > 
> > Your log doesn't seem to match the issues others are having.  There's only
> a
> > single DMAR fault:
> > 
> > [    1.887994] dmar: DRHD: handling fault status reg 2
> > [    1.888349] dmar: DMAR:[DMA Read] Request device [0b:00.0] fault addr 0 
> > DMAR:[fault reason 06] PTE Read access is not set
> > 
> > This is a read access to physical address 0x0 from function 0.  It seems
> > valid for the IOMMU to block this, the driver can possibly have mapped a
> > buffer at 0x0.  Later the log shows problems probing these channels:

Correction, the device attempted to access I/O virtual address 0x0, not physical address 0x0, but I think the conclusion is the same, the hardware is doing a stray DMA access.

[...]
> By adding 0x9230 ID for my device do you mean to quirks.c as others have
> done? If so then I already did that with no change. 

Yes, if adding 0x9230 made no change then I don't think this device suffers from the same problem as the other Marvell controllers reported here.  The expected failure mode is that during channel probing the device generates several DMAR faults to non-zero addresses where the requests are being generated using function 1 rather than the correct requester ID.  When quirked, the IOMMU maps both function 0 and function 1 through the IOMMU, allowing these accesses and the device works.

Your report in comment 44 indicates that the DMAR error from your device was always from the correct function, just to an unmapped address.  This suggests the hardware might be using a DMA read as a ways to flush previous transactions, effectively a bus synchronization.

> Adding the iommu=pt option to the kernel command line does fix the dmar
> error but is that all that is necessary? When this option is added do all
> devices get passed through or just those incompatible with the MMU? The
> kernel command line options documentation is sparse on what this does.
> 
> And if adding iommu=pt is all I need to do then I appreciate the work (and I
> apologize if I sent you on a wild goose chase trying to fix this) as my
> system boots much faster and feels generally more responsive.

The passthrough option is probably intentionally vague because it depends a little on how the IOMMU driver interprets it.  On VT-d, an IOMMU domain is created that identity maps all memory.  With the exception of devices that can only do 32bit DMA, all devices will be attached to this domain.  This means you lose the isolation capabilities of the IOMMU for most of your host devices, but you can still use the IOMMU for device assignment.

Another possible solution to this problem that would maintain the most usefulness of the IOMMU would be for the driver to map a scratch DMA page at this address for the hardware to access.  We could also create a quirk for the devices that only maps this device to the identity mapping domain.  In any case, it should probably be handled in a separate bug.
Comment 55 Joshua Scoggins 2014-05-03 17:40:19 UTC
Created attachment 134951 [details]
Comparison of kernel 3.10.25 and 3.15.0 w/ iommu=pt without quirks entries

Well I went back to 3.10.25 thinking that the solution was to put iommu=pt and 
everything should be okay but doing that with 3.10.25 causes quite a large 
number of DMAR errors to show up during boot which I have a dmesg output log
of. 

This got me thinking that perhaps your patches actually fix the issue so I
commented out my quirk entry, recompiled, rebooted, and got the similar (if not
identical) DMAR errors (log attached as well). So it seems that:

1) My marvell 9230 controller is not compatible with the IOMMU and iommu=pt
   needs to be added to the kernel command line
2) Your patches do solve the issue after I add an entry for my controller to
   drivers/pci/quirks.c
Comment 56 Alex Williamson 2014-05-03 18:24:26 UTC
Ok, I'll keep the quirk for 0x9230, and then there are issues beyond that that we need to tackle since this device is extra broken.  Thanks
Comment 57 Tobias N 2014-05-03 19:34:56 UTC
Created attachment 134991 [details]
without and witch patch on Gigabyte GA-X79-UP4 Marvell 88SE9172

Update of Comment 47
@Alex Williamson: Your Kernel from git is working on Gigabyte GA-X79-UP4 Rev1.1 BIOS F7 in Debian Testing Jessie.

diff --git a/drivers/pci/quirks.c b/drivers/pci/quirks.c
index ea55b0f..f0d8b11 100644
--- a/drivers/pci/quirks.c
+++ b/drivers/pci/quirks.c
@@ -3366,6 +3366,8 @@ DECLARE_PCI_FIXUP_HEADER(PCI_VENDOR_ID_MARVELL_EXT, 0x9123,
 /* https://bugzilla.kernel.org/show_bug.cgi?id=42679#c14 */
 DECLARE_PCI_FIXUP_HEADER(PCI_VENDOR_ID_MARVELL_EXT, 0x9130,
                         quirk_dma_func1_alias);
+DECLARE_PCI_FIXUP_HEADER(PCI_VENDOR_ID_MARVELL_EXT, 0x9172,
+                        quirk_dma_func1_alias);
 /* https://bugs.gentoo.org/show_bug.cgi?id=497630 */
 DECLARE_PCI_FIXUP_HEADER(PCI_VENDOR_ID_JMICRON,
                         PCI_DEVICE_ID_JMICRON_JMB388_ESD,

The Harddisk is accessible with vt-d active and its possible to vfio passthru x-vga in qemu 1.7.

Attached are the dmesg and lspci Log without and with patch, If you need more information, just send me a message.

Thank you for your Work.
Comment 58 Joshua Scoggins 2014-05-03 21:06:54 UTC
(In reply to Alex Williamson from comment #56)
> Ok, I'll keep the quirk for 0x9230, and then there are issues beyond that
> that we need to tackle since this device is extra broken.  Thanks

This device is very much extra broken. If the disks are under heavy load from something like an MDADM raid recovery then running smartctl on the last disk on the controller will trigger a SMART command failure or a IDENTIFY DEVICE command failure which will cause the drive's link to be reset. Fortunately, I have to poke it to get it to do it. Here is the output from dmesg

[Sat May  3 13:40:38 2014] ata8.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 frozen
[Sat May  3 13:40:38 2014] ata8.00: failed command: IDENTIFY DEVICE
[Sat May  3 13:40:38 2014] ata8.00: cmd ec/00:01:00:00:00/00:00:00:00:00/00 tag 24 pio 512 in
         res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
[Sat May  3 13:40:38 2014] ata8.00: status: { DRDY }
[Sat May  3 13:40:38 2014] ata8: hard resetting link
[Sat May  3 13:40:39 2014] ata8: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
[Sat May  3 13:40:39 2014] ata8.00: configured for UDMA/133
[Sat May  3 13:40:39 2014] ata8: EH complete
[Sat May  3 13:41:13 2014] ata8.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 frozen
[Sat May  3 13:41:13 2014] ata8.00: failed command: IDENTIFY DEVICE
[Sat May  3 13:41:13 2014] ata8.00: cmd ec/00:01:00:00:00/00:00:00:00:00/00 tag 20 pio 512 in
         res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
[Sat May  3 13:41:13 2014] ata8.00: status: { DRDY }
[Sat May  3 13:41:13 2014] ata8: hard resetting link
[Sat May  3 13:41:14 2014] ata8: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
[Sat May  3 13:41:14 2014] ata8.00: configured for UDMA/133
[Sat May  3 13:41:14 2014] ata8: EH complete
[Sat May  3 13:42:18 2014] ata8.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 frozen
[Sat May  3 13:42:18 2014] ata8.00: failed command: SMART
[Sat May  3 13:42:18 2014] ata8.00: cmd b0/d1:01:01:4f:c2/00:00:00:00:00/00 tag 19 pio 512 in
         res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
[Sat May  3 13:42:18 2014] ata8.00: status: { DRDY }
[Sat May  3 13:42:18 2014] ata8: hard resetting link
[Sat May  3 13:42:19 2014] ata8: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
[Sat May  3 13:42:19 2014] ata8.00: configured for UDMA/133
[Sat May  3 13:42:19 2014] ata8: EH complete

If you need more information then send me a message, as for system stability, I'm going to disable the iommu for the time being.

Once again, I really appreciate all of the work.
Comment 59 Alex Williamson 2014-05-08 19:41:15 UTC
Found another one right under my nose:

04:00.0 IDE interface [0101]: Marvell Technology Group Ltd. 88SE9172 SATA III 6Gb/s RAID Controller [1b4b:917a] (rev 11)

This is one where function 1 doesn't exist, causing some trouble on AMD-Vi.  Will be fixed in v2 of the patches.
Comment 60 lt-83 2014-05-09 03:47:34 UTC
I stumbled upon this while attempting to install ESXi 5.5 Update 1 on an older Gigabyte GA-X58A-UD3R (rev. 2.0) motherboard (BIOS Fh1) with Vt-d enabled that has a Marvell 9128 controller with 2 ports. It seems the solution would be to switch controllers and deal with SATA2 speeds, as I intend to try VT-d pass through.

If I do re-purpose this machine towards a Linux install, I will take a look at the patch, however. Here's a picture of the vmkernel log (Alt+F12 when installing): http://i.imgur.com/lwNHEuH.jpg
Comment 61 daxcore 2014-05-20 08:21:07 UTC
Patch is working for me (ASUS P7P55D-E PRO). Thanks!

Will this patch released in a coming kernel tag - in which version?
Comment 62 yourpadremb 2014-05-24 02:23:20 UTC
Using 3.15-rc6 in my Gigabyte z77x-ud5h still have the problem

http://pastebin.com/uNcaJ9d2

The current patch does not work anymore
Comment 63 Alex Williamson 2014-05-24 02:32:37 UTC
(In reply to yourpadremb from comment #62)
> Using 3.15-rc6 in my Gigabyte z77x-ud5h still have the problem
> 
> http://pastebin.com/uNcaJ9d2
> 
> The current patch does not work anymore

Which current patch doesn't work anymore?

Does this work for you https://lkml.org/lkml/2014/5/22/685

?
Comment 64 Andrew Cooks 2014-05-24 02:53:58 UTC
The 'current' patch set is the one posted by Alex to lkml, not the one attached to this bug report. Alex's patch set really is the way forward (even if it's missing some device IDs) and I'm hopeful that it will be picked up by mainline soon. 
 
The patch I attached here will not apply to 3.15, because of other changes in the intel iommu driver. It's easy to fix, but you really should be using Alex's patch set instead.

I saw the pastebin log title is: "linux 3.15-rc6 and Marvell SATA 88SE9128". Please be more specific about the device id (9128) in your bugzilla comments in future.

Alex: the 9128 device ID is missing in v4 of your patch set.
Comment 65 yourpadremb 2014-05-24 03:08:35 UTC
Sorry, I did a mistake. I confuse the name of the Marvell card with the current used here

lspci -nn |grep SATA
00:1f.2 SATA controller [0106]: Intel Corporation 7 Series/C210 Series Chipset Family 6-port SATA Controller [AHCI mode] [8086:1e02] (rev 04)
03:00.0 SATA controller [0106]: Marvell Technology Group Ltd. 88SE9172 SATA 6Gb/s Controller [1b4b:9172] (rev 11)
08:00.0 SATA controller [0106]: Marvell Technology Group Ltd. 88SE9172 SATA 6Gb/s Controller [1b4b:9172] (rev 11)
Comment 66 Alex Williamson 2014-05-24 03:13:21 UTC
(In reply to Andrew Cooks from comment #64)
> The 'current' patch set is the one posted by Alex to lkml, not the one
> attached to this bug report. Alex's patch set really is the way forward
> (even if it's missing some device IDs) and I'm hopeful that it will be
> picked up by mainline soon. 
>  
> The patch I attached here will not apply to 3.15, because of other changes
> in the intel iommu driver. It's easy to fix, but you really should be using
> Alex's patch set instead.
> 
> I saw the pastebin log title is: "linux 3.15-rc6 and Marvell SATA 88SE9128".
> Please be more specific about the device id (9128) in your bugzilla comments
> in future.
> 
> Alex: the 9128 device ID is missing in v4 of your patch set.

Can you reference a bug comment that identifies this device ID?  I can pick it up if there's another rev of the series, otherwise I'd prefer to add it as a follow-on patch so we don't overload upstream.
Comment 67 *cJ* 2014-05-30 06:50:34 UTC
Here with a HighPoint RocketRaid 642L which has the 88SE9235 chip and can be supported by the ahci driver (patch submitted).

Alex, I tried to rebase your dma-alias branch on top of Linus' master, but I can't boot with the quirk when the card is in the computer (I've added an entry for the VID/PID for this card 0x1103/0x0642).
Using AMD 990FX (ASUS Sabertooth 990FX, rev1 I think) and some IOMMU-related kernel command-line parameters: iommu=1 ivrs_ioapic[9]=00:14.0 ivrs_ioapic[10]=00:00.1 which are apparently needed because of BIOS borkedness.

Side note 1: Just for fun I went through HighPoint's support since they provide a driver, and even a web GUI which needs to be used to configure the extensive features of the board... which I have never used or never plan to use.
"Dear customer, The driver doesn't support IOMMU. You need to disable it from system BIOS."

Side note 2: With IOMMU disabled, I observed the same kind of issues as Joshua, but not with every device plugged on it though. But that's out of scope for this bug.
Comment 68 Alex Williamson 2014-05-30 14:17:15 UTC
(In reply to *cJ* from comment #67)
> Here with a HighPoint RocketRaid 642L which has the 88SE9235 chip and can be
> supported by the ahci driver (patch submitted).
> 
> Alex, I tried to rebase your dma-alias branch on top of Linus' master, but I
> can't boot with the quirk when the card is in the computer (I've added an
> entry for the VID/PID for this card 0x1103/0x0642).
> Using AMD 990FX (ASUS Sabertooth 990FX, rev1 I think) and some IOMMU-related
> kernel command-line parameters: iommu=1 ivrs_ioapic[9]=00:14.0
> ivrs_ioapic[10]=00:00.1 which are apparently needed because of BIOS
> borkedness.

What errors do you get on stock upstream?  How does the problem change with the dma-alias-v4 tag (without additional VID/DID)?  With additional VID/DID?
Comment 69 *cJ* 2014-05-30 15:01:07 UTC
Can boot with the card on master linux, and get the same AMD-Vi IOMMU PAGE_FAULT errors with the HighPoint official driver (once it's patched to compile...).

I'll check later today with dma-alias-v4.
Comment 70 *cJ* 2014-06-03 18:41:43 UTC
(sorry for the delay) dma-alias-v4 is awesome (unlike this hardware)!
I see no more IOMMU issues with the quirk.

Tested-by: Jérôme Carretero <cJ-ko@zougloub.eu>
Comment 71 Rich.T. 2014-07-28 16:59:00 UTC
Created attachment 144451 [details]
Trouble passing through my Highpoint RocketRAID 640L PCIE Storage controller to a domU in my XenCenter home server.
Comment 72 Rich.T. 2014-07-28 17:12:08 UTC
Hi,

I am having trouble passing through my Highpoint RocketRAID 640L PCIE Storage controller to a domU in my XenCenter home server.

http://www.highpoint-tech.com/USA_new/series_rr600-overview.htm

It works fine on bare metal OSes, but seems to give behavior suspiciously similar to this bug, when I attempt to pass it through to a XenServer domU:

https://bugzilla.kernel.org/show_bug.cgi?id=42679

Server hardware: Asus Sabertooth 990FX motherboard with an AMD FX8350 8 core processor and 32GB RAM.

https://bugzilla.kernel.org/attachment.cgi?id=144451

Could I fix this with this with the "dma-alias-v4" patch?
This does not seem to exist, as far as I can tell: git://github.com/awilliam/linux-vfio.git dma-alias-v4

Has this patch been committed to the kernel and is the reason I'm having problems that XenServer Uses 3.10?

Any help would be appreciated.
Comment 73 *cJ* 2014-07-28 17:24:52 UTC
Rich, I added the PCI ids for the 642L but not 640L, in the regular AHCI driver, and for the DMA quirk. You'd need to add the PCI IDs for the 640L (see commits c2e0fb966ad8ab3c41eb6fbf0239d06f287a2505, d251836508fb26cd1a22b41381739835ee23728d). If you have comments, don't hesitate!

Yes I think the issue would be fixed with the patch (and the added PCI ids).
Note that the patches are in mainline 3.15 now, maybe it's an option for you to upgrade.


BTW, should this bug be marked as resolved?
Comment 74 Rich.T. 2014-07-28 17:41:49 UTC
Hi.

Don't mark as resolved just yet; this may be a deeper problem in XenServer:
As I said; the attached storage shows up and works just fine in the XenServer dom0 (3.10 kernel), but does not pass through to an Ubuntu Server domU (3.13 kernel). This seems weird to me.

I'll try to upgrade the 3.15 kernel in the domU (ownCloud VM) and see what happens, but XenServer is under heavy, active development ATM (though still based on the *old* CentOS5 build) and is giving me severe problems doing *anything* on its dom0. Also, this card is unsupported on XenServer, so I don't anticipate ant help from them to this end.

If a kernel upgrade to 3.15 (on the doU in question) solves this problem, then it may be possible to put this one to bed.

Thanks for the *very* prompt reply,

Rich.
Comment 75 Alex Williamson 2014-07-28 17:52:59 UTC
This bug is not yet resolved, the full set of IOMMU changes won't be upstream until 3.17.  Unfortunately Rich, XenServer is an entirely different beast and needs a bug filed somewhere else.  The kernel changes here are not going to magically fix the Xen hypervisor.  Xen folks may choose a similar solution, but that's for them to decide.
Comment 76 Rich.T. 2014-07-28 18:27:03 UTC
IMHO, Citrix / XenServer developers will not waste their time with one; they have enough on their plate just getting XS up to date and adding long overdue features and improvements, without being asked to deal with *unsupported* hardware etc.

Xen devs, upstream, might be more receptive, but they're having enough trouble ATM in just getting their basic wiki articles fit for use (I'm right now looking to see if I can help them in a big documentation drive this week).

I'll Google around this issue again and see what I can see, from the Xen perspective.

Thanks for your responses,

Rich.
Comment 77 Jerry Chen 2014-08-02 23:11:06 UTC
I've compiled a RPM for those who are using CentOS 6.5.

http://jerry.sh/kernel/kernel-3.15.0_rc4+-1.x86_64.rpm
http://jerry.sh/kernel/kernel-devel-3.15.0_rc4+-1.x86_64.rpm
http://jerry.sh/kernel/kernel-headers-3.15.0_rc4+-1.x86_64.rpm

Compile with Alex Williamson's dma-alias-v4 source.
Comment 78 CJ 2014-08-08 22:52:38 UTC
I don't know if the DMAR errors I am receiving have anything to do with this thread.  But I have been researching this problem & keeping abreast of the comments here, & just wanted to offer more insight & any support required.

These DMAR errors are occurring on my ASUS X79 Rampage IV Black Edition.  It happens for the same reasons stated here (intel_iommu is set) but with a slightly different configuration.  It happens with my Plextor SSD running on a Sonnet PCIe Adapter Card:


[    1.504803] dmar: DRHD: handling fault status reg 2
[    1.504806] dmar: DMAR:[DMA Write] Request device [0b:00.1] fault addr fffe0000 
DMAR:[fault reason 02] Present bit in context entry is clear
[    1.844208] dmar: DRHD: handling fault status reg 102
[    1.844488] dmar: DMAR:[DMA Write] Request device [0b:00.1] fault addr fffe0000 
DMAR:[fault reason 02] Present bit in context entry is clear
[    1.997325] dmar: DRHD: handling fault status reg 202
[    1.997327] dmar: DMAR:[DMA Write] Request device [0b:00.1] fault addr fffe0000 
DMAR:[fault reason 02] Present bit in context entry is clear
[    7.003663] dmar: DRHD: handling fault status reg 302
[    7.003949] dmar: DMAR:[DMA Write] Request device [0b:00.1] fault addr fffe0000 
DMAR:[fault reason 02] Present bit in context entry is clear
[    7.322245] dmar: DRHD: handling fault status reg 402
[    7.322515] dmar: DMAR:[DMA Write] Request device [0b:00.1] fault addr fffe0000 
DMAR:[fault reason 02] Present bit in context entry is clear
[    7.475646] dmar: DRHD: handling fault status reg 502
[    7.475917] dmar: DMAR:[DMA Write] Request device [0b:00.1] fault addr fffe0000 
DMAR:[fault reason 02] Present bit in context entry is clear
[   12.481676] dmar: DRHD: handling fault status reg 602
[   12.481962] dmar: DMAR:[DMA Write] Request device [0b:00.1] fault addr fffe0000 
DMAR:[fault reason 02] Present bit in context entry is clear
[   12.800264] dmar: DRHD: handling fault status reg 702
[   12.800535] dmar: DMAR:[DMA Write] Request device [0b:00.1] fault addr fffe0000 
DMAR:[fault reason 02] Present bit in context entry is clear
[   12.953665] dmar: DRHD: handling fault status reg 2
[   12.953935] dmar: DMAR:[DMA Write] Request device [0b:00.1] fault addr fffe0000 
DMAR:[fault reason 02] Present bit in context entry is clear
[   17.959657] dmar: DRHD: handling fault status reg 102
[   17.959944] dmar: DMAR:[DMA Write] Request device [0b:00.1] fault addr fffe0000 
DMAR:[fault reason 02] Present bit in context entry is clear
[   18.278287] dmar: DRHD: handling fault status reg 202
[   18.278558] dmar: DMAR:[DMA Write] Request device [0b:00.1] fault addr fffe0000 
DMAR:[fault reason 02] Present bit in context entry is clear


There is no 0b:00.1 device.  However, when I query pci devices for "0b", I get the following:

root [ ~ ]# lspci | grep -i 0b
0b:00.0 SATA controller: Marvell Technology Group Ltd. Device 9182 (rev 11)
ff:0b.0 System peripheral: Intel Corporation Xeon E7 v2/Xeon E5 v2/Core i7 UBOX Registers (rev 04)
ff:0b.3 System peripheral: Intel Corporation Xeon E7 v2/Xeon E5 v2/Core i7 UBOX Registers (rev 04)

I read somewhere that just removing iommu from the config, things should work fine.  And, so they did.  No DMAR errors.  But, after reading comments here, I thought I would try to see if changing quirks.c would work for me.  So, I turned on iommu in the config & modified quirks.c:

DECLARE_PCI_FIXUP_HEADER(PCI_VENDOR_ID_MARVELL_EXT, 0x9182,
                         quirk_dma_func1_alias);

and got the same DMAR errors I had been getting before I had any understanding that intel_iommu along with certain hw could cause some serious issues.


This site, & in particular, these postings have been a TREMENDOUS help to me.  Thanks EVERYBODY!!!
Comment 79 CJ 2014-08-09 00:05:11 UTC
(In reply to CJ from comment #78)
> I don't know if the DMAR errors I am receiving have anything to do with this
> thread.  But I have been researching this problem & keeping abreast of the
> comments here, & just wanted to offer more insight & any support required.
> 
> These DMAR errors are occurring on my ASUS X79 Rampage IV Black Edition.  It
> happens for the same reasons stated here (intel_iommu is set) but with a
> slightly different configuration.  It happens with my Plextor SSD running on
> a Sonnet PCIe Adapter Card:
> 
> 
> [    1.504803] dmar: DRHD: handling fault status reg 2
> [    1.504806] dmar: DMAR:[DMA Write] Request device [0b:00.1] fault addr
> fffe0000 
> DMAR:[fault reason 02] Present bit in context entry is clear
> [    1.844208] dmar: DRHD: handling fault status reg 102
> [    1.844488] dmar: DMAR:[DMA Write] Request device [0b:00.1] fault addr
> fffe0000 
> DMAR:[fault reason 02] Present bit in context entry is clear
> [    1.997325] dmar: DRHD: handling fault status reg 202
> [    1.997327] dmar: DMAR:[DMA Write] Request device [0b:00.1] fault addr
> fffe0000 
> DMAR:[fault reason 02] Present bit in context entry is clear
> [    7.003663] dmar: DRHD: handling fault status reg 302
> [    7.003949] dmar: DMAR:[DMA Write] Request device [0b:00.1] fault addr
> fffe0000 
> DMAR:[fault reason 02] Present bit in context entry is clear
> [    7.322245] dmar: DRHD: handling fault status reg 402
> [    7.322515] dmar: DMAR:[DMA Write] Request device [0b:00.1] fault addr
> fffe0000 
> DMAR:[fault reason 02] Present bit in context entry is clear
> [    7.475646] dmar: DRHD: handling fault status reg 502
> [    7.475917] dmar: DMAR:[DMA Write] Request device [0b:00.1] fault addr
> fffe0000 
> DMAR:[fault reason 02] Present bit in context entry is clear
> [   12.481676] dmar: DRHD: handling fault status reg 602
> [   12.481962] dmar: DMAR:[DMA Write] Request device [0b:00.1] fault addr
> fffe0000 
> DMAR:[fault reason 02] Present bit in context entry is clear
> [   12.800264] dmar: DRHD: handling fault status reg 702
> [   12.800535] dmar: DMAR:[DMA Write] Request device [0b:00.1] fault addr
> fffe0000 
> DMAR:[fault reason 02] Present bit in context entry is clear
> [   12.953665] dmar: DRHD: handling fault status reg 2
> [   12.953935] dmar: DMAR:[DMA Write] Request device [0b:00.1] fault addr
> fffe0000 
> DMAR:[fault reason 02] Present bit in context entry is clear
> [   17.959657] dmar: DRHD: handling fault status reg 102
> [   17.959944] dmar: DMAR:[DMA Write] Request device [0b:00.1] fault addr
> fffe0000 
> DMAR:[fault reason 02] Present bit in context entry is clear
> [   18.278287] dmar: DRHD: handling fault status reg 202
> [   18.278558] dmar: DMAR:[DMA Write] Request device [0b:00.1] fault addr
> fffe0000 
> DMAR:[fault reason 02] Present bit in context entry is clear
> 
> 
> There is no 0b:00.1 device.  However, when I query pci devices for "0b", I
> get the following:
> 
> root [ ~ ]# lspci | grep -i 0b
> 0b:00.0 SATA controller: Marvell Technology Group Ltd. Device 9182 (rev 11)
> ff:0b.0 System peripheral: Intel Corporation Xeon E7 v2/Xeon E5 v2/Core i7
> UBOX Registers (rev 04)
> ff:0b.3 System peripheral: Intel Corporation Xeon E7 v2/Xeon E5 v2/Core i7
> UBOX Registers (rev 04)
> 
> I read somewhere that just removing iommu from the config, things should
> work fine.  And, so they did.  No DMAR errors.  But, after reading comments
> here, I thought I would try to see if changing quirks.c would work for me. 
> So, I turned on iommu in the config & modified quirks.c:
> 
> DECLARE_PCI_FIXUP_HEADER(PCI_VENDOR_ID_MARVELL_EXT, 0x9182,
>                          quirk_dma_func1_alias);
> 
> and got the same DMAR errors I had been getting before I had any
> understanding that intel_iommu along with certain hw could cause some
> serious issues.
> 
> 
> This site, & in particular, these postings have been a TREMENDOUS help to
> me.  Thanks EVERYBODY!!!

[Edit]
OK... I am a little confused.  I tried a couple of tests to get a little more insight into this issue:

   1) With my changes to quirks.c & intel_iommu_default set (INTEL_IOMMU_DEFAULT_ON=y), I get the DMAR errors.
   2) I left my changes in quirks.c & changed the kernel config to NOT set intel_iommu as the default (INTEL_IOMMU_DEFAULT_ON=n).  No DMAR errors... great!
   3) I reverted my changes to quirks.c to the original & left config as is.  No DMAR errors.

This also means I can see my PCI-SSD device in linux (ie, gparted).  But when I checked the .config file itself, INTEL_IOMMU_DEFAULT is turned on. ????(:?  Everything appears to be working.  I can only keep an eye on any impacts moving forward.  Thanks.
Comment 80 Alex Williamson 2014-08-09 00:43:24 UTC
CJ, what kernel are you adding your quirk too?  If it's not a kernel patched with the v4 patch set or the current pre-3.17-rc1 tree, then the IOMMU code is not yet in place to make use of the quirk.
Comment 81 CJ 2014-08-09 01:25:23 UTC
(In reply to Alex Williamson from comment #80)
> CJ, what kernel are you adding your quirk too?  If it's not a kernel patched
> with the v4 patch set or the current pre-3.17-rc1 tree, then the IOMMU code
> is not yet in place to make use of the quirk.

Hi Alex,

I am using v3.16.
Comment 82 Alex Williamson 2014-08-09 01:53:49 UTC
Then the quirk doesn't actually make any difference yet.  Use the current development tree or wait for 3.17-rc1.
Comment 83 CJ 2014-08-09 02:02:05 UTC
(In reply to Alex Williamson from comment #82)
> Then the quirk doesn't actually make any difference yet.  Use the current
> development tree or wait for 3.17-rc1.

Got it!  Thank you!  Much appreciation!!
Comment 84 Jerry Chen 2014-08-09 23:44:18 UTC
This is weird.

/var/log/messages is flooded with:
Aug  9 16:42:26 Hypervisor kernel: dmar: DRHD: handling fault status reg 3
Aug  9 16:42:26 Hypervisor kernel: dmar: DMAR:[DMA Read] Request device [02:00.0] fault addr 100000000
Aug  9 16:42:26 Hypervisor kernel: DMAR:[fault reason 06] PTE Read access is not set

02:00.0 is a PCI (not PCIe) Intel NIC

# lspci
02:00.0 Ethernet controller: Intel Corporation 82541PI Gigabit Ethernet Controller (rev 05)

# lspci -n
02:00.0 0200: 8086:107c (rev 05)

Any chances that this is related?
Comment 85 CJ 2014-08-13 22:19:58 UTC
(In reply to Jerry Chen from comment #84)
> This is weird.
> 
> /var/log/messages is flooded with:
> Aug  9 16:42:26 Hypervisor kernel: dmar: DRHD: handling fault status reg 3
> Aug  9 16:42:26 Hypervisor kernel: dmar: DMAR:[DMA Read] Request device
> [02:00.0] fault addr 100000000
> Aug  9 16:42:26 Hypervisor kernel: DMAR:[fault reason 06] PTE Read access is
> not set
> 
> 02:00.0 is a PCI (not PCIe) Intel NIC
> 
> # lspci
> 02:00.0 Ethernet controller: Intel Corporation 82541PI Gigabit Ethernet
> Controller (rev 05)
> 
> # lspci -n
> 02:00.0 0200: 8086:107c (rev 05)
> 
> Any chances that this is related?

Well, something happened between v3.12.26 & v3.16.  Referring to "Comment 79" where these errors were produced on v3.16 with intel_iommu set as default (INTEL_IOMMU_DEFAULT=y), they do NOT show up on v3.12.26 with the default set.  I know this will be fixed in 3.17.  But, it looks like v3.12.26 somehow got looked over as it relates to this problem.  This version, IMO, appears to be really solid.
Comment 86 Jerry Chen 2014-08-15 00:54:33 UTC
(In reply to CJ from comment #85)
> (In reply to Jerry Chen from comment #84)
> > This is weird.
> > 
> > /var/log/messages is flooded with:
> > Aug  9 16:42:26 Hypervisor kernel: dmar: DRHD: handling fault status reg 3
> > Aug  9 16:42:26 Hypervisor kernel: dmar: DMAR:[DMA Read] Request device
> > [02:00.0] fault addr 100000000
> > Aug  9 16:42:26 Hypervisor kernel: DMAR:[fault reason 06] PTE Read access
> is
> > not set
> > 
> > 02:00.0 is a PCI (not PCIe) Intel NIC
> > 
> > # lspci
> > 02:00.0 Ethernet controller: Intel Corporation 82541PI Gigabit Ethernet
> > Controller (rev 05)
> > 
> > # lspci -n
> > 02:00.0 0200: 8086:107c (rev 05)
> > 
> > Any chances that this is related?
> 
> Well, something happened between v3.12.26 & v3.16.  Referring to "Comment
> 79" where these errors were produced on v3.16 with intel_iommu set as
> default (INTEL_IOMMU_DEFAULT=y), they do NOT show up on v3.12.26 with the
> default set.  I know this will be fixed in 3.17.  But, it looks like
> v3.12.26 somehow got looked over as it relates to this problem.  This
> version, IMO, appears to be really solid.

You are right, 3.12.16 seems to be the perfect kernel on this issue.
Comment 87 Jerry Chen 2014-08-15 08:09:05 UTC
(In reply to CJ from comment #85)
> (In reply to Jerry Chen from comment #84)
> > This is weird.
> > 
> > /var/log/messages is flooded with:
> > Aug  9 16:42:26 Hypervisor kernel: dmar: DRHD: handling fault status reg 3
> > Aug  9 16:42:26 Hypervisor kernel: dmar: DMAR:[DMA Read] Request device
> > [02:00.0] fault addr 100000000
> > Aug  9 16:42:26 Hypervisor kernel: DMAR:[fault reason 06] PTE Read access
> is
> > not set
> > 
> > 02:00.0 is a PCI (not PCIe) Intel NIC
> > 
> > # lspci
> > 02:00.0 Ethernet controller: Intel Corporation 82541PI Gigabit Ethernet
> > Controller (rev 05)
> > 
> > # lspci -n
> > 02:00.0 0200: 8086:107c (rev 05)
> > 
> > Any chances that this is related?
> 
> Well, something happened between v3.12.26 & v3.16.  Referring to "Comment
> 79" where these errors were produced on v3.16 with intel_iommu set as
> default (INTEL_IOMMU_DEFAULT=y), they do NOT show up on v3.12.26 with the
> default set.  I know this will be fixed in 3.17.  But, it looks like
> v3.12.26 somehow got looked over as it relates to this problem.  This
> version, IMO, appears to be really solid.

Can you share your .config for 3.12.26? It seems like I messed up something.
Comment 88 Andrew Cooks 2014-08-15 09:16:38 UTC
Jerry,

DMAR:[fault reason 06] is not the same as 
DMAR:[fault reason 02] and the device does not use a Marvell chipset.

Therefore it is very unlikely that the problem you are experiencing is related.

Please file a separate bug report using the instructions in the 'REPORTING-BUGS' file in your kernel source tree.
Comment 89 Kayvan Javid 2014-08-23 11:52:17 UTC
Marvell 88SE923 chip here, still problems using Ubuntu 3.17 mainline kernel:
http://kernel.ubuntu.com/~kernel-ppa/mainline/v3.17-rc1-utopic/
Comment 90 Alex Williamson 2014-08-23 13:46:41 UTC
(In reply to Kayvan Javid from comment #89)
> Marvell 88SE923 chip here, still problems using Ubuntu 3.17 mainline kernel:
> http://kernel.ubuntu.com/~kernel-ppa/mainline/v3.17-rc1-utopic/

Do we have your device ID included?

https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/tree/drivers/pci/quirks.c#n3458
Comment 91 Kayvan Javid 2014-08-25 14:28:17 UTC
lspci -nn
01:00.0 SATA controller [0106]: Marvell Technology Group Ltd. 88SE9230 PCIe SATA 6Gb/s Controller [1b4b:9230] (rev 11)

It is in the quirks.c cgit link line 3479:
/* https://bugzilla.kernel.org/show_bug.cgi?id=42679#c49 */
DECLARE_PCI_FIXUP_HEADER(PCI_VENDOR_ID_MARVELL_EXT, 0x9230,
			 quirk_dma_func1_alias);

Still getting output as per comment 58.
Comment 92 Alex Williamson 2014-08-25 14:35:16 UTC
(In reply to Kayvan Javid from comment #91)
> lspci -nn
> 01:00.0 SATA controller [0106]: Marvell Technology Group Ltd. 88SE9230 PCIe
> SATA 6Gb/s Controller [1b4b:9230] (rev 11)
> 
> It is in the quirks.c cgit link line 3479:
> /* https://bugzilla.kernel.org/show_bug.cgi?id=42679#c49 */
> DECLARE_PCI_FIXUP_HEADER(PCI_VENDOR_ID_MARVELL_EXT, 0x9230,
>                        quirk_dma_func1_alias);
> 
> Still getting output as per comment 58.

Comment 58 isn't even a DMA fault.  Please file a new bug, we can't track every broken aspect of Marvell controllers in a single bug.
Comment 93 Kayvan Javid 2014-08-25 14:51:59 UTC
You are quite right, I am *not* seeing any DMA problems with the Marvell 9230.

I misread the bug, and a lot of the dmesg output people have posted include the same ATA errors I am seeing.
Comment 94 Elliott 2014-10-13 18:06:44 UTC
Add the Marvell 9183 to the list:

with intel_iommu=on:

dmesg |grep dmar
[    0.026475] dmar: Host address width 39
[    0.026476] dmar: DRHD base: 0x000000fed90000 flags: 0x0
[    0.026482] dmar: IOMMU 0: reg_base_addr fed90000 ver 1:0 cap c0000020660462 ecap f0101a
[    0.026483] dmar: DRHD base: 0x000000fed91000 flags: 0x1
[    0.026487] dmar: IOMMU 1: reg_base_addr fed91000 ver 1:0 cap d2008c20660462 ecap f010da
[    0.026487] dmar: RMRR base: 0x000000d8ce3000 end: 0x000000d8ceffff
[    0.026488] dmar: RMRR base: 0x000000db000000 end: 0x000000df1fffff
[    0.719857] dmar: DRHD: handling fault status reg 2
[    0.719878] dmar: DMAR:[DMA Write] Request device [02:00.1] fault addr fffe0000 
[    1.034557] dmar: DRHD: handling fault status reg 3
[    1.034576] dmar: DMAR:[DMA Write] Request device [02:00.1] fault addr fffe0000 
[    6.042308] dmar: DRHD: handling fault status reg 2
[    6.042339] dmar: DMAR:[DMA Write] Request device [02:00.1] fault addr fffe0000 [    0.026475] dmar: Host address width 39
[    0.026476] dmar: DRHD base: 0x000000fed90000 flags: 0x0
[    0.026482] dmar: IOMMU 0: reg_base_addr fed90000 ver 1:0 cap c0000020660462 ecap f0101a
[    0.026483] dmar: DRHD base: 0x000000fed91000 flags: 0x1
[    0.026487] dmar: IOMMU 1: reg_base_addr fed91000 ver 1:0 cap d2008c20660462 ecap f010da
[    0.026487] dmar: RMRR base: 0x000000d8ce3000 end: 0x000000d8ceffff
[    0.026488] dmar: RMRR base: 0x000000db000000 end: 0x000000df1fffff
[    0.719857] dmar: DRHD: handling fault status reg 2
[    0.719878] dmar: DMAR:[DMA Write] Request device [02:00.1] fault addr fffe0000 
[    1.034557] dmar: DRHD: handling fault status reg 3
[    1.034576] dmar: DMAR:[DMA Write] Request device [02:00.1] fault addr fffe0000 
[    6.042308] dmar: DRHD: handling fault status reg 2
[    6.042339] dmar: DMAR:[DMA Write] Request device [02:00.1] fault addr fffe0000 
[    6.350034] dmar: DRHD: handling fault status reg 3
[    6.350053] dmar: DMAR:[DMA Write] Request device [02:00.1] fault addr fffe0000 
[   11.357836] dmar: DRHD: handling fault status reg 2
[   11.357864] dmar: DMAR:[DMA Write] Request device [02:00.1] fault addr fffe0000 
[   11.665512] dmar: DRHD: handling fault status reg 3
[   11.665532] dmar: DMAR:[DMA Write] Request device [02:00.1] fault addr fffe0000 
[   16.673282] dmar: DRHD: handling fault status reg 2
[   16.673311] dmar: DMAR:[DMA Write] Request device [02:00.1] fault addr fffe0000 

[    6.350034] dmar: DRHD: handling fault status reg 3
[    6.350053] dmar: DMAR:[DMA Write] Request device [02:00.1] fault addr fffe0000 
[   11.357836] dmar: DRHD: handling fault status reg 2
[   11.357864] dmar: DMAR:[DMA Write] Request device [02:00.1] fault addr fffe0000 
[   11.665512] dmar: DRHD: handling fault status reg 3
[   11.665532] dmar: DMAR:[DMA Write] Request device [02:00.1] fault addr fffe0000 
[   16.673282] dmar: DRHD: handling fault status reg 2
[   16.673311] dmar: DMAR:[DMA Write] Request device [02:00.1] fault addr fffe0000 

lspci -nnv
02:00.0 SATA controller [0106]: Device [1c28:0122] (rev 14) (prog-if 01 [AHCI 1.0])
 Subsystem: Marvell Technology Group Ltd. Device [1b4b:9183]

Note, for google: this is the controller embedded in the Plextor m6e M.2 SSD
Comment 95 Alex Williamson 2014-10-21 19:25:20 UTC
(In reply to Elliott from comment #94)
> Add the Marvell 9183 to the list:
> 
> with intel_iommu=on:
> 
> dmesg |grep dmar
...
> [    0.719878] dmar: DMAR:[DMA Write] Request device [02:00.1] fault addr
> fffe0000 
> [    1.034557] dmar: DRHD: handling fault status reg 3
...
> 
> lspci -nnv
> 02:00.0 SATA controller [0106]: Device [1c28:0122] (rev 14) (prog-if 01
> [AHCI 1.0])
>  Subsystem: Marvell Technology Group Ltd. Device [1b4b:9183]
> 
> Note, for google: this is the controller embedded in the Plextor m6e M.2 SSD

FWIW, the PCI vendor ID is actually Lite-on.  Can you confirm this patch resolves the problem:

--- a/drivers/pci/quirks.c
+++ b/drivers/pci/quirks.c
@@ -3484,6 +3484,8 @@ DECLARE_PCI_FIXUP_HEADER(PCI_VENDOR_ID_TTI, 0x0642,
 DECLARE_PCI_FIXUP_HEADER(PCI_VENDOR_ID_JMICRON,
                         PCI_DEVICE_ID_JMICRON_JMB388_ESD,
                         quirk_dma_func1_alias);
+/* https://bugzilla.kernel.org/show_bug.cgi?id=42679#c94 */
+DECLARE_PCI_FIXUP_HEADER(0x1c28, 0x0122, quirk_dma_func1_alias);
 
 /*
  * A few PCIe-to-PCI bridges fail to expose a PCIe capability, resulting in
Comment 96 Elliott 2014-10-22 19:16:59 UTC
(In reply to Alex Williamson from comment #95)
No luck. I tried both the patch you posted, and I also tried:

DECLARE_PCI_FIXUP_HEADER(PCI_VENDOR_ID_MARVELL_EXT, 0x9183, quirk_dma_func1_alias);

Just for good measure. In both cases, the same errors as posted in c94 when intel_iommu=on.

Let me know if there's anything else I can do to try and help debug.
Thanks,
Elliott
Comment 97 Alex Williamson 2014-10-22 19:29:49 UTC
(In reply to Elliott from comment #96)
> (In reply to Alex Williamson from comment #95)
> No luck. I tried both the patch you posted, and I also tried:

Do any of these kernel boot options make a difference:

pci=nomsi

intremap=off

intremap=nosid

Given your fault address, I expect they might all work.  The last option is the least invasive.
Comment 98 Elliott 2014-10-25 11:47:12 UTC
Created attachment 154931 [details]
dmesg intremap=nosid
Comment 99 Elliott 2014-10-25 11:47:38 UTC
Created attachment 154941 [details]
dmesg intremap=off
Comment 100 Elliott 2014-10-25 11:48:00 UTC
Created attachment 154951 [details]
dmesg intel_iommu=off
Comment 101 Elliott 2014-10-25 11:48:18 UTC
Created attachment 154961 [details]
dmesg intel_iommu=on
Comment 102 Elliott 2014-10-25 11:48:38 UTC
Created attachment 154971 [details]
dmesg pci=nomsi
Comment 103 Elliott 2014-10-25 11:50:26 UTC
I tried with all of the kernel options you recommended, and also included dmesg with intel_iommu=on and intel_iommu=off for comparison. Regardless of the other kernel options, the SSD is not visible when intel_iommu=on
Comment 104 Elliott 2014-10-27 15:22:29 UTC
Created attachment 155441 [details]
dmesg linux 3.17

Arch just pushed the 3.17 kernel, bug is still present. Would you like me to re-run the intremap and nomsi kernel options?
Comment 105 Martin Öhrling 2014-11-04 10:46:27 UTC
Created attachment 156481 [details]
attachment-10507-0.html

I can't see any complaint about present bit being cleared in comment 94.
There are
likely entries for both function 0 and 1. It seems like you have another
problem...

Did you use the controller to boot the kernel? I noticed issues when using
the Marvell
controller as boot device. My best guess is that the BIOS assigned memory
to the
controller that it is still accessing. Problem is that the kernel wasn't
informed about it.
Could your problem be the same?



2014-10-27 16:22 GMT+01:00 <bugzilla-daemon@bugzilla.kernel.org>:

> https://bugzilla.kernel.org/show_bug.cgi?id=42679
>
> --- Comment #104 from Elliott <elliott@iamelliott.com> ---
> Created attachment 155441 [details]
>   --> https://bugzilla.kernel.org/attachment.cgi?id=155441&action=edit
> dmesg linux 3.17
>
> Arch just pushed the 3.17 kernel, bug is still present. Would you like me
> to
> re-run the intremap and nomsi kernel options?
>
> --
> You are receiving this mail because:
> You are on the CC list for the bug.
>
Comment 106 Martin Öhrling 2014-11-04 10:57:06 UTC
Accidentally created an attachment. Can't seem to find any way to remove it.
Sorry about that... Please feel free to remove it if possible.
Comment 107 Elliott 2014-11-04 19:11:24 UTC
My kernel lives on another disk drive. /dev/sda1 is my EFI system partition, /dev/sda2 is the MSR, /dev/sda3 is NTFS Windows 7, /dev/sda4 is my / partition. Marvell controller is the SSD on /dev/sdb. I don't know what you mean by "preset bit" (sorry, I'm not so fluent in C).

I'm using the SSD with an embedded Marvell controller as a caching device (enhanceio when I posted to this bug, but I just switched to bcache) for a slower hard drive. I did briefly consider enhanceio might be the problem, so I disabled it completely to test. This didn't make a difference; with intel_iommu, the kernel throws the dmar errors, and I can't access /dev/sdb.
Comment 108 Martin Öhrling 2014-11-04 20:40:01 UTC
The quirk installs entries for both function numbers. If function 1 would have been unknown, you would have seen warnings about presence bit not set (see comment 78 as example). The lack of those messages indicates that you successfully installed entries for both function 0 and 1, hence that the patch is working.

You can still run into problems if the chip tries to read/write memory that isn't allocated by the driver module. The problems I saw was related to the controller being initiated and used by the BIOS during boot. It tried to read memory that didn't belong to it (as fas as the linux kernel was concerned). The controller stopped working when the DMA read failed (blocked by the iommu).

It is not necessarily an error that the controller is assigned memory during boot. Although these memory regions must be presented to the operating system. This is where the vt-d support seems to fail on many consumer boards.
Comment 109 frollic 2015-02-04 18:21:06 UTC
Is there any progress ?

I'm hitting this error on Fedora 3.17.8-200.fc20 kernel, which makes my system pretty much unusable :(

07:00.0 SATA controller [0106]: Marvell Technology Group Ltd. 88SE9230 PCIe SATA 6Gb/s Controller [1b4b:9230] (rev 10) (prog-if 01 [AHCI 1.0])
        DeviceName:  Marvell 9230 AHCI controller
        Subsystem: Marvell Technology Group Ltd. 88SE9230 PCIe SATA 6Gb/s Controller [1b4b:9230]
        Flags: bus master, fast devsel, latency 0, IRQ 92
        I/O ports at b050 [size=8]
        I/O ports at b040 [size=4]
        I/O ports at b030 [size=8]
        I/O ports at b020 [size=4]
        I/O ports at b000 [size=32]
        Memory at 90610000 (32-bit, non-prefetchable) [size=2K]
        Expansion ROM at 90600000 [disabled] [size=64K]
        Capabilities: [40] Power Management version 3
        Capabilities: [50] MSI: Enable+ Count=1/1 Maskable- 64bit-
        Capabilities: [70] Express Legacy Endpoint, MSI 00
        Capabilities: [e0] SATA HBA v0.0
        Capabilities: [100] Advanced Error Reporting
        Kernel driver in use: ahci

Motherboard is Supermicro X10SBA - http://www.supermicro.nl/products/motherboard/celeron/X10/X10SBA.cfm
Comment 110 Alex Williamson 2015-02-04 18:36:11 UTC
(In reply to frollic from comment #109)
> Is there any progress ?
> 
> I'm hitting this error on Fedora 3.17.8-200.fc20 kernel, which makes my
> system pretty much unusable :(
> 
> 07:00.0 SATA controller [0106]: Marvell Technology Group Ltd. 88SE9230 PCIe
> SATA 6Gb/s Controller [1b4b:9230] (rev 10) (prog-if 01 [AHCI 1.0])

It should have been fixed in v3.16 by cc346a4714 for this device.  Are you sure you're seeing the same error?  What are the symptoms?
Comment 111 Alex Williamson 2015-02-04 18:43:27 UTC
Actually, refreshing my memory in the comments here, others are also reporting that issues for 1b4b:9230 persist, but they're different than the problem we're trying to fix here and suggest either broken hardware or broken driver (or both).  As suggested previously, if you're not getting DMAR faults, file a new bug.
Comment 112 frollic 2015-02-04 19:00:39 UTC
Indeed, I don't have DMAR errors in my syslog.

Drives are 3 * WDC WD20EFRX-68EUZN0, 82.00A82, max UDMA/133 running
soft-RAID5. 
One SAMSUNG SSD SM841 mSATA 128GB, DXM43D0Q, max UDMA/133 in a mSAT->SATA case/converter.

Feb  4 19:09:43 atlantis kernel: [  464.228813] ata3: failed to read log page 10h (errno=-5)
Feb  4 19:09:43 atlantis kernel: [  464.231988] ata3.00: exception Emask 0x1 SAct 0xc000 SErr 0x0 action 0x0
Feb  4 19:09:43 atlantis kernel: [  464.235233] ata3.00: irq_stat 0x40000008
Feb  4 19:09:43 atlantis kernel: ata3: failed to read log page 10h (errno=-5)
Feb  4 19:09:43 atlantis kernel: ata3.00: exception Emask 0x1 SAct 0xc000 SErr 0x0 action 0x0
Feb  4 19:09:43 atlantis kernel: ata3.00: irq_stat 0x40000008
Feb  4 19:09:43 atlantis kernel: [  464.238596] ata3.00: failed command: READ FPDMA QUEUED
Feb  4 19:09:43 atlantis kernel: [  464.242000] ata3.00: cmd 60/00:70:90:3b:bc/04:00:0c:00:00/40 tag 14 ncq 524288 in
Feb  4 19:09:43 atlantis kernel: [  464.242000]          res 50/00:00:00:00:00/00:00:00:00:00/a0 Emask 0x1 (device error)
Feb  4 19:09:43 atlantis kernel: ata3.00: failed command: READ FPDMA QUEUED
Feb  4 19:09:43 atlantis kernel: ata3.00: cmd 60/00:70:90:3b:bc/04:00:0c:00:00/40 tag 14 ncq 524288 in
         res 50/00:00:00:00:00/00:00:00:00:00/a0 Emask 0x1 (device error)
Feb  4 19:09:43 atlantis kernel: [  464.248733] ata3.00: status: { DRDY }
Feb  4 19:09:43 atlantis kernel: ata3.00: status: { DRDY }
Feb  4 19:09:43 atlantis kernel: [  464.252192] ata3.00: failed command: READ FPDMA QUEUED
Feb  4 19:09:43 atlantis kernel: [  464.255558] ata3.00: cmd 60/00:78:90:3f:bc/04:00:0c:00:00/40 tag 15 ncq 524288 in
Feb  4 19:09:43 atlantis kernel: [  464.255558]          res 50/00:00:00:00:00/00:00:00:00:00/a0 Emask 0x1 (device error)
Feb  4 19:09:43 atlantis kernel: ata3.00: failed command: READ FPDMA QUEUED
Feb  4 19:09:43 atlantis kernel: ata3.00: cmd 60/00:78:90:3f:bc/04:00:0c:00:00/40 tag 15 ncq 524288 in
         res 50/00:00:00:00:00/00:00:00:00:00/a0 Emask 0x1 (device error)
Feb  4 19:09:43 atlantis kernel: [  464.262523] ata3.00: status: { DRDY }
Feb  4 19:09:43 atlantis kernel: ata3.00: status: { DRDY }
Feb  4 19:09:43 atlantis kernel: [  464.272877] ata3.00: revalidation failed (errno=-2)
Feb  4 19:09:43 atlantis kernel: [  464.276284] ata3: hard resetting link
Feb  4 19:09:43 atlantis kernel: ata3.00: revalidation failed (errno=-2)
Feb  4 19:09:43 atlantis kernel: ata3: hard resetting link
Feb  4 19:09:44 atlantis kernel: [  464.586712] ata3: SATA link up 6.0 Gbps (SStatus 133 SControl 300)
Feb  4 19:09:44 atlantis kernel: ata3: SATA link up 6.0 Gbps (SStatus 133 SControl 300)
Feb  4 19:09:44 atlantis kernel: [  464.593370] ata3.00: configured for UDMA/133
Feb  4 19:09:44 atlantis kernel: [  464.596855] ata3: EH complete
Feb  4 19:09:44 atlantis kernel: ata3.00: configured for UDMA/133
Feb  4 19:09:44 atlantis kernel: ata3: EH complete
Feb  4 19:10:03 atlantis kernel: [  484.234979] ata3: failed to read log page 10h (errno=-5)
Feb  4 19:10:03 atlantis kernel: [  484.238484] ata3.00: exception Emask 0x1 SAct 0xc000000 SErr 0x0 action 0x0
Feb  4 19:10:03 atlantis kernel: [  484.242039] ata3.00: irq_stat 0x40000008
Feb  4 19:10:03 atlantis kernel: ata3: failed to read log page 10h (errno=-5)
Feb  4 19:10:03 atlantis kernel: ata3.00: exception Emask 0x1 SAct 0xc000000 SErr 0x0 action 0x0
Feb  4 19:10:03 atlantis kernel: ata3.00: irq_stat 0x40000008
Feb  4 19:10:03 atlantis kernel: ata3.00: failed command: READ FPDMA QUEUED
Feb  4 19:10:03 atlantis kernel: [  484.245881] ata3.00: failed command: READ FPDMA QUEUED
Feb  4 19:10:03 atlantis kernel: [  484.249500] ata3.00: cmd 60/00:d0:90:c3:ee/04:00:0c:00:00/40 tag 26 ncq 524288 in
Feb  4 19:10:03 atlantis kernel: [  484.249500]          res 50/00:02:00:00:00/00:00:00:00:00/a0 Emask 0x1 (device error)
Feb  4 19:10:03 atlantis kernel: ata3.00: cmd 60/00:d0:90:c3:ee/04:00:0c:00:00/40 tag 26 ncq 524288 in
         res 50/00:02:00:00:00/00:00:00:00:00/a0 Emask 0x1 (device error)
Feb  4 19:10:03 atlantis kernel: ata3.00: status: { DRDY }
Feb  4 19:10:03 atlantis kernel: [  484.256948] ata3.00: status: { DRDY }
Feb  4 19:10:03 atlantis kernel: ata3.00: failed command: READ FPDMA QUEUED
Feb  4 19:10:03 atlantis kernel: [  484.260818] ata3.00: failed command: READ FPDMA QUEUED
Feb  4 19:10:03 atlantis kernel: [  484.264631] ata3.00: cmd 60/00:d8:90:c7:ee/04:00:0c:00:00/40 tag 27 ncq 524288 in
Feb  4 19:10:03 atlantis kernel: [  484.264631]          res 50/00:02:00:00:00/00:00:00:00:00/a0 Emask 0x1 (device error)
Feb  4 19:10:03 atlantis kernel: ata3.00: cmd 60/00:d8:90:c7:ee/04:00:0c:00:00/40 tag 27 ncq 524288 in
         res 50/00:02:00:00:00/00:00:00:00:00/a0 Emask 0x1 (device error)
Feb  4 19:10:03 atlantis kernel: ata3.00: status: { DRDY }
Feb  4 19:10:03 atlantis kernel: [  484.272371] ata3.00: status: { DRDY }
Feb  4 19:10:03 atlantis kernel: [  484.283077] ata3.00: revalidation failed (errno=-2)
Feb  4 19:10:03 atlantis kernel: [  484.286982] ata3: hard resetting link
Feb  4 19:10:03 atlantis kernel: ata3.00: revalidation failed (errno=-2)
Feb  4 19:10:03 atlantis kernel: ata3: hard resetting link
Feb  4 19:10:04 atlantis kernel: [  484.596697] ata3: SATA link up 6.0 Gbps (SStatus 133 SControl 300)
Feb  4 19:10:04 atlantis kernel: ata3: SATA link up 6.0 Gbps (SStatus 133 SControl 300)
Feb  4 19:10:04 atlantis kernel: [  484.603078] ata3.00: configured for UDMA/133
Feb  4 19:10:04 atlantis kernel: [  484.606931] ata3: EH complete
Feb  4 19:10:04 atlantis kernel: ata3.00: configured for UDMA/133
Feb  4 19:10:04 atlantis kernel: ata3: EH complete
---snip con'd----
Feb  4 19:24:11 atlantis kernel: [ 1333.841337] ata3: failed to read log page 10h (errno=-5)
Feb  4 19:24:11 atlantis kernel: [ 1333.846065] ata3.00: exception Emask 0x1 SAct 0x6 SErr 0x0 action 0x0
Feb  4 19:24:11 atlantis kernel: [ 1333.850813] ata3.00: irq_stat 0x40000008
Feb  4 19:24:11 atlantis kernel: ata3: failed to read log page 10h (errno=-5)
Feb  4 19:24:11 atlantis kernel: ata3.00: exception Emask 0x1 SAct 0x6 SErr 0x0 action 0x0
Feb  4 19:24:11 atlantis kernel: ata3.00: irq_stat 0x40000008
Feb  4 19:24:11 atlantis kernel: [ 1333.855588] ata3.00: failed command: READ FPDMA QUEUED
Feb  4 19:24:11 atlantis kernel: [ 1333.860450] ata3.00: cmd 60/00:08:20:20:91/04:00:15:00:00/40 tag 1 ncq 524288 in
Feb  4 19:24:11 atlantis kernel: [ 1333.860450]          res 50/00:00:00:00:00/00:00:00:00:00/a0 Emask 0x1 (device error)
Feb  4 19:24:11 atlantis kernel: [ 1333.869939] ata3.00: status: { DRDY }
Feb  4 19:24:11 atlantis kernel: [ 1333.874723] ata3.00: failed command: READ FPDMA QUEUED
Feb  4 19:24:11 atlantis kernel: [ 1333.879449] ata3.00: cmd 60/00:10:20:24:91/04:00:15:00:00/40 tag 2 ncq 524288 in
Feb  4 19:24:11 atlantis kernel: [ 1333.879449]          res 50/00:00:00:00:00/00:00:00:00:00/a0 Emask 0x1 (device error)
Feb  4 19:24:11 atlantis kernel: [ 1333.888934] ata3.00: status: { DRDY }
Feb  4 19:24:11 atlantis kernel: ata3.00: failed command: READ FPDMA QUEUED
Feb  4 19:24:11 atlantis kernel: ata3.00: cmd 60/00:08:20:20:91/04:00:15:00:00/40 tag 1 ncq 524288 in
         res 50/00:00:00:00:00/00:00:00:00:00/a0 Emask 0x1 (device error)
Feb  4 19:24:11 atlantis kernel: ata3.00: status: { DRDY }
Feb  4 19:24:11 atlantis kernel: ata3.00: failed command: READ FPDMA QUEUED
Feb  4 19:24:11 atlantis kernel: ata3.00: cmd 60/00:10:20:24:91/04:00:15:00:00/40 tag 2 ncq 524288 in
         res 50/00:00:00:00:00/00:00:00:00:00/a0 Emask 0x1 (device error)
Feb  4 19:24:11 atlantis kernel: ata3.00: status: { DRDY }
Feb  4 19:24:11 atlantis kernel: [ 1333.900431] ata3.00: revalidation failed (errno=-2)
Feb  4 19:24:11 atlantis kernel: [ 1333.905185] ata3: hard resetting link
Feb  4 19:24:11 atlantis kernel: ata3.00: revalidation failed (errno=-2)
Feb  4 19:24:11 atlantis kernel: ata3: hard resetting link
Feb  4 19:24:12 atlantis kernel: [ 1334.217053] ata3: SATA link up 6.0 Gbps (SStatus 133 SControl 300)
Feb  4 19:24:12 atlantis kernel: ata3: SATA link up 6.0 Gbps (SStatus 133 SControl 300)
Feb  4 19:24:12 atlantis kernel: [ 1334.225017] ata3.00: configured for UDMA/133
Feb  4 19:24:12 atlantis kernel: [ 1334.229846] ata3: EH complete
Feb  4 19:24:12 atlantis kernel: ata3.00: configured for UDMA/133
Feb  4 19:24:12 atlantis kernel: ata3: EH complete
Feb  4 19:25:29 atlantis kernel: [ 1411.927668] ata3.00: exception Emask 0x10 SAct 0xc00000 SErr 0x200000 action 0x6 frozen
Feb  4 19:25:29 atlantis kernel: [ 1411.931534] ata3.00: irq_stat 0x08000008, interface fatal error
Feb  4 19:25:29 atlantis kernel: [ 1411.935186] ata3: SError: { BadCRC }
Feb  4 19:25:29 atlantis kernel: ata3.00: exception Emask 0x10 SAct 0xc00000 SErr 0x200000 action 0x6 frozen
Feb  4 19:25:29 atlantis kernel: ata3.00: irq_stat 0x08000008, interface fatal error
Feb  4 19:25:29 atlantis kernel: ata3: SError: { BadCRC }
Feb  4 19:25:29 atlantis kernel: [ 1411.938834] ata3.00: failed command: READ FPDMA QUEUED
Feb  4 19:25:29 atlantis kernel: ata3.00: failed command: READ FPDMA QUEUED
Feb  4 19:25:29 atlantis kernel: [ 1411.942474] ata3.00: cmd 60/00:b0:28:d5:5a/04:00:16:00:00/40 tag 22 ncq 524288 in
Feb  4 19:25:29 atlantis kernel: [ 1411.942474]          res 50/00:00:00:00:00/00:00:00:00:00/a0 Emask 0x10 (ATA bus error)
Feb  4 19:25:29 atlantis kernel: ata3.00: cmd 60/00:b0:28:d5:5a/04:00:16:00:00/40 tag 22 ncq 524288 in
         res 50/00:00:00:00:00/00:00:00:00:00/a0 Emask 0x10 (ATA bus error)
Feb  4 19:25:29 atlantis kernel: [ 1411.949448] ata3.00: status: { DRDY }
Feb  4 19:25:29 atlantis kernel: ata3.00: status: { DRDY }
Feb  4 19:25:29 atlantis kernel: ata3.00: failed command: READ FPDMA QUEUED
Feb  4 19:25:29 atlantis kernel: [ 1411.952943] ata3.00: failed command: READ FPDMA QUEUED
Feb  4 19:25:29 atlantis kernel: [ 1411.956309] ata3.00: cmd 60/00:b8:28:d9:5a/04:00:16:00:00/40 tag 23 ncq 524288 in
Feb  4 19:25:29 atlantis kernel: [ 1411.956309]          res 50/00:00:00:00:00/00:00:00:00:00/a0 Emask 0x10 (ATA bus error)
Feb  4 19:25:29 atlantis kernel: ata3.00: cmd 60/00:b8:28:d9:5a/04:00:16:00:00/40 tag 23 ncq 524288 in
         res 50/00:00:00:00:00/00:00:00:00:00/a0 Emask 0x10 (ATA bus error)
Feb  4 19:25:29 atlantis kernel: [ 1411.962998] ata3.00: status: { DRDY }
Feb  4 19:25:29 atlantis kernel: ata3.00: status: { DRDY }
Feb  4 19:25:29 atlantis kernel: [ 1411.966227] ata3: hard resetting link
Feb  4 19:25:29 atlantis kernel: ata3: hard resetting link
Feb  4 19:25:35 atlantis kernel: [ 1417.335416] ata3: link is slow to respond, please be patient (ready=0)
Feb  4 19:25:35 atlantis kernel: ata3: link is slow to respond, please be patient (ready=0)
Feb  4 19:25:39 atlantis kernel: [ 1421.989709] ata3: COMRESET failed (errno=-16)
Feb  4 19:25:39 atlantis kernel: [ 1421.994600] ata3: hard resetting link
Feb  4 19:25:39 atlantis kernel: ata3: COMRESET failed (errno=-16)
Feb  4 19:25:39 atlantis kernel: ata3: hard resetting link
Feb  4 19:25:45 atlantis kernel: [ 1427.364236] ata3: link is slow to respond, please be patient (ready=0)
Feb  4 19:25:45 atlantis kernel: ata3: link is slow to respond, please be patient (ready=0)
Feb  4 19:25:49 atlantis kernel: [ 1432.019121] ata3: COMRESET failed (errno=-16)
Feb  4 19:25:49 atlantis kernel: [ 1432.023757] ata3: hard resetting link
Feb  4 19:25:49 atlantis kernel: ata3: COMRESET failed (errno=-16)
Feb  4 19:25:49 atlantis kernel: ata3: hard resetting link
Feb  4 19:25:55 atlantis kernel: [ 1437.392833] ata3: link is slow to respond, please be patient (ready=0)
Feb  4 19:25:55 atlantis kernel: ata3: link is slow to respond, please be patient (ready=0)
Feb  4 19:26:24 atlantis kernel: [ 1467.136950] ata3: COMRESET failed (errno=-16)
Feb  4 19:26:24 atlantis kernel: [ 1467.141422] ata3: limiting SATA link speed to 3.0 Gbps
Feb  4 19:26:24 atlantis kernel: [ 1467.145685] ata3: hard resetting link
Feb  4 19:26:24 atlantis kernel: ata3: COMRESET failed (errno=-16)
Feb  4 19:26:24 atlantis kernel: ata3: limiting SATA link speed to 3.0 Gbps
Feb  4 19:26:24 atlantis kernel: ata3: hard resetting link
Feb  4 19:26:29 atlantis kernel: [ 1472.156765] ata3: COMRESET failed (errno=-16)
Feb  4 19:26:29 atlantis kernel: [ 1472.160801] ata3: reset failed, giving up
Feb  4 19:26:29 atlantis kernel: [ 1472.164731] ata3.00: disabled
Feb  4 19:26:29 atlantis kernel: ata3: COMRESET failed (errno=-16)
Feb  4 19:26:29 atlantis kernel: ata3: reset failed, giving up
Feb  4 19:26:29 atlantis kernel: ata3.00: disabled
Feb  4 19:26:29 atlantis kernel: [ 1472.168844] ata3: EH complete
Feb  4 19:26:29 atlantis kernel: ata3: EH complete
Feb  4 19:26:29 atlantis kernel: sd 2:0:0:0: [sdc]
Feb  4 19:26:29 atlantis kernel: Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK
Feb  4 19:26:29 atlantis kernel: sd 2:0:0:0: [sdc] CDB:
Feb  4 19:26:29 atlantis kernel: Read(10): 28 00 16 5a d9 28 00 04 00 00
Feb  4 19:26:29 atlantis kernel: end_request: I/O error, dev sdc, sector 375052584
Feb  4 19:26:29 atlantis kernel: [ 1472.172994] sd 2:0:0:0: [sdc]
Feb  4 19:26:29 atlantis kernel: [ 1472.176560] Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK
Feb  4 19:26:29 atlantis kernel: [ 1472.180054] sd 2:0:0:0: [sdc] CDB:
Feb  4 19:26:29 atlantis kernel: [ 1472.183421] Read(10): 28 00 16 5a d9 28 00 04 00 00
Feb  4 19:26:29 atlantis kernel: [ 1472.186732] end_request: I/O error, dev sdc, sector 375052584
Feb  4 19:26:29 atlantis kernel: [ 1472.190084] sd 2:0:0:0: [sdc]
Feb  4 19:26:29 atlantis kernel: [ 1472.190550] sd 2:0:0:0: [sdc]
Feb  4 19:26:29 atlantis kernel: [ 1472.190553] Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK
Feb  4 19:26:29 atlantis kernel: [ 1472.190556] sd 2:0:0:0: [sdc] CDB:
Feb  4 19:26:29 atlantis kernel: [ 1472.190565] Read(10): 28 00 16 5a d9 28 00 00 80 00
Feb  4 19:26:29 atlantis kernel: [ 1472.190568] end_request: I/O error, dev sdc, sector 375052584
Feb  4 19:26:29 atlantis kernel: [ 1472.190594] sd 2:0:0:0: [sdc]
Feb  4 19:26:29 atlantis kernel: [ 1472.190596] Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK
Feb  4 19:26:29 atlantis kernel: [ 1472.190599] sd 2:0:0:0: [sdc] CDB:
Feb  4 19:26:29 atlantis kernel: [ 1472.190607] Read(10): 28 00 16 5a d9 a8 00 00 80 00
Feb  4 19:26:29 atlantis kernel: [ 1472.190610] end_request: I/O error, dev sdc, sector 375052712
Feb  4 19:26:29 atlantis kernel: [ 1472.190630] sd 2:0:0:0: [sdc]
Feb  4 19:26:29 atlantis kernel: [ 1472.190631] Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK
Feb  4 19:26:29 atlantis kernel: [ 1472.190634] sd 2:0:0:0: [sdc] CDB:
Feb  4 19:26:29 atlantis kernel: [ 1472.190642] Read(10): 28 00 16 5a da 28 00 00 80 00
Feb  4 19:26:29 atlantis kernel: [ 1472.190644] end_request: I/O error, dev sdc, sector 375052840
Feb  4 19:26:29 atlantis kernel: [ 1472.190664] sd 2:0:0:0: [sdc]
Feb  4 19:26:29 atlantis kernel: [ 1472.190665] Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK
Feb  4 19:26:29 atlantis kernel: [ 1472.190668] sd 2:0:0:0: [sdc] CDB:
Feb  4 19:26:29 atlantis kernel: [ 1472.190676] Read(10): 28 00 16 5a da a8 00 00 80 00
Feb  4 19:26:29 atlantis kernel: [ 1472.190678] end_request: I/O error, dev sdc, sector 375052968
Feb  4 19:26:29 atlantis kernel: [ 1472.190697] sd 2:0:0:0: [sdc]
Feb  4 19:26:29 atlantis kernel: [ 1472.190698] Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK
Feb  4 19:26:29 atlantis kernel: [ 1472.190701] sd 2:0:0:0: [sdc] CDB:
Feb  4 19:26:29 atlantis kernel: [ 1472.190709] Read(10): 28 00 16 5a db 28 00 00 80 00
Feb  4 19:26:29 atlantis kernel: [ 1472.190711] end_request: I/O error, dev sdc, sector 375053096
Feb  4 19:26:29 atlantis kernel: [ 1472.190730] sd 2:0:0:0: [sdc]
Feb  4 19:26:29 atlantis kernel: [ 1472.190732] Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK
Feb  4 19:26:29 atlantis kernel: [ 1472.190734] sd 2:0:0:0: [sdc] CDB:
Feb  4 19:26:29 atlantis kernel: [ 1472.190742] Read(10): 28 00 16 5a db a8 00 00 80 00
Feb  4 19:26:29 atlantis kernel: [ 1472.190744] end_request: I/O error, dev sdc, sector 375053224
Feb  4 19:26:29 atlantis kernel: [ 1472.191566] sd 2:0:0:0: [sdc]
Feb  4 19:26:29 atlantis kernel: [ 1472.191568] Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK
Feb  4 19:26:29 atlantis kernel: [ 1472.191571] sd 2:0:0:0: [sdc] CDB:
Feb  4 19:26:29 atlantis kernel: [ 1472.191581] Read(10): 28 00 16 5a dc 28 00 00 80 00
Feb  4 19:26:29 atlantis kernel: [ 1472.191584] end_request: I/O error, dev sdc, sector 375053352
Feb  4 19:26:29 atlantis kernel: [ 1472.191613] sd 2:0:0:0: [sdc]
Feb  4 19:26:29 atlantis kernel: [ 1472.191614] Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK
Feb  4 19:26:29 atlantis kernel: [ 1472.191617] sd 2:0:0:0: [sdc] CDB:
Feb  4 19:26:29 atlantis kernel: [ 1472.191625] Read(10): 28 00 16 5a dc a8 00 00 80 00
Feb  4 19:26:29 atlantis kernel: [ 1472.191627] end_request: I/O error, dev sdc, sector 375053480
Feb  4 19:26:29 atlantis kernel: [ 1472.309900] Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK
Feb  4 19:26:29 atlantis kernel: sd 2:0:0:0: [sdc]
Feb  4 19:26:29 atlantis kernel: sd 2:0:0:0: [sdc]
Feb  4 19:26:29 atlantis kernel: [ 1472.312780] sd 2:0:0:0: [sdc] CDB:
Feb  4 19:26:29 atlantis kernel: Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK
Feb  4 19:26:29 atlantis kernel: sd 2:0:0:0: [sdc] CDB:
Feb  4 19:26:29 atlantis kernel: Read(10): 28 00 16 5a d9 28 00 00 80 00
Feb  4 19:26:29 atlantis kernel: end_request: I/O error, dev sdc, sector 375052584
Feb  4 19:26:29 atlantis kernel: sd 2:0:0:0: [sdc]
Feb  4 19:26:29 atlantis kernel: Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK
Feb  4 19:26:29 atlantis kernel: sd 2:0:0:0: [sdc] CDB:
Feb  4 19:26:29 atlantis kernel: Read(10): 28 00 16 5a d9 a8 00 00 80 00
Feb  4 19:26:29 atlantis kernel: [ 1472.315653] Read(10): 28 00 16 5a d5 28 00 04 00 00
Feb  4 19:26:29 atlantis kernel: [ 1472.318569] end_request: I/O error, dev sdc, sector 375051560
Feb  4 19:26:29 atlantis kernel: end_request: I/O error, dev sdc, sector 375052712
Feb  4 19:26:29 atlantis kernel: sd 2:0:0:0: [sdc]
Feb  4 19:26:29 atlantis kernel: Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK
Feb  4 19:26:29 atlantis kernel: sd 2:0:0:0: [sdc] CDB:
Feb  4 19:26:29 atlantis kernel: Read(10): 28 00 16 5a da 28 00 00 80 00
Feb  4 19:26:29 atlantis kernel: end_request: I/O error, dev sdc, sector 375052840
Feb  4 19:26:30 atlantis kernel: [ 1472.321543] md/raid:md127: Too many read errors, failing device sdc1.
Feb  4 19:26:29 atlantis kernel: sd 2:0:0:0: [sdc]
Feb  4 19:26:29 atlantis kernel: Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK
Feb  4 19:26:29 atlantis kernel: sd 2:0:0:0: [sdc] CDB:
Feb  4 19:26:29 atlantis kernel: Read(10): 28 00 16 5a da a8 00 00 80 00
Feb  4 19:26:29 atlantis kernel: end_request: I/O error, dev sdc, sector 375052968
Feb  4 19:26:30 atlantis kernel: [ 1472.324618] md/raid:md127: Disk failure on sdc1, disabling device.
Feb  4 19:26:30 atlantis kernel: [ 1472.324618] md/raid:md127: Operation continuing on 2 devices.
Feb  4 19:26:29 atlantis kernel: sd 2:0:0:0: [sdc]
Feb  4 19:26:29 atlantis kernel: Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK
Feb  4 19:26:29 atlantis kernel: sd 2:0:0:0: [sdc] CDB:
Feb  4 19:26:29 atlantis kernel: Read(10): 28 00 16 5a db 28 00 00 80 00
Feb  4 19:26:29 atlantis kernel: end_request: I/O error, dev sdc, sector 375053096
Feb  4 19:26:29 atlantis kernel: sd 2:0:0:0: [sdc]
Feb  4 19:26:29 atlantis kernel: Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK
Feb  4 19:26:30 atlantis kernel: [ 1472.330348] md/raid:md127: read error not correctable (sector 375049520 on sdc1).
Feb  4 19:26:29 atlantis kernel: sd 2:0:0:0: [sdc] CDB:
Feb  4 19:26:29 atlantis kernel: Read(10): 28 00 16 5a db a8 00 00 80 00
Feb  4 19:26:29 atlantis kernel: end_request: I/O error, dev sdc, sector 375053224
Feb  4 19:26:29 atlantis kernel: sd 2:0:0:0: [sdc]
Feb  4 19:26:30 atlantis kernel: [ 1472.333180] md/raid:md127: read error not correctable (sector 375049528 on sdc1).
Feb  4 19:26:29 atlantis kernel: Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK
Feb  4 19:26:29 atlantis kernel: sd 2:0:0:0: [sdc] CDB:
Feb  4 19:26:29 atlantis kernel: Read(10): 28 00 16 5a dc 28 00 00 80 00
Feb  4 19:26:29 atlantis kernel: end_request: I/O error, dev sdc, sector 375053352
Feb  4 19:26:29 atlantis kernel: sd 2:0:0:0: [sdc]
Feb  4 19:26:30 atlantis kernel: [ 1472.336083] md/raid:md127: read error not correctable (sector 375049536 on sdc1).
Feb  4 19:26:29 atlantis kernel: Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK
Feb  4 19:26:29 atlantis kernel: sd 2:0:0:0: [sdc] CDB:
Feb  4 19:26:29 atlantis kernel: Read(10): 28 00 16 5a dc a8 00 00 80 00
Feb  4 19:26:29 atlantis kernel: end_request: I/O error, dev sdc, sector 375053480
Feb  4 19:26:29 atlantis kernel: Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK
Feb  4 19:26:30 atlantis kernel: [ 1472.338834] md/raid:md127: read error not correctable (sector 375049544 on sdc1).
Feb  4 19:26:30 atlantis kernel: [ 1472.341483] md/raid:md127: read error not correctable (sector 375049552 on sdc1).
Feb  4 19:26:29 atlantis kernel: sd 2:0:0:0: [sdc] CDB:
Feb  4 19:26:29 atlantis kernel: Read(10): 28 00 16 5a d5 28 00 04 00 00
Feb  4 19:26:29 atlantis kernel: end_request: I/O error, dev sdc, sector 375051560
Feb  4 19:26:30 atlantis kernel: md/raid:md127: Too many read errors, failing device sdc1.
Feb  4 19:26:30 atlantis kernel: md/raid:md127: Disk failure on sdc1, disabling device.
md/raid:md127: Operation continuing on 2 devices.
Feb  4 19:26:30 atlantis kernel: [ 1472.344136] md/raid:md127: read error not correctable (sector 375049560 on sdc1).
Feb  4 19:26:30 atlantis kernel: [ 1472.346947] md/raid:md127: read error not correctable (sector 375049568 on sdc1).
Feb  4 19:26:30 atlantis kernel: md/raid:md127: read error not correctable (sector 375049520 on sdc1).
Feb  4 19:26:30 atlantis kernel: [ 1472.349635] md/raid:md127: read error not correctable (sector 375049576 on sdc1).
Feb  4 19:26:30 atlantis kernel: md/raid:md127: read error not correctable (sector 375049528 on sdc1).
Feb  4 19:26:30 atlantis kernel: md/raid:md127: read error not correctable (sector 375049536 on sdc1).
Feb  4 19:26:30 atlantis kernel: md/raid:md127: read error not correctable (sector 375049544 on sdc1).
Feb  4 19:26:30 atlantis kernel: md/raid:md127: read error not correctable (sector 375049552 on sdc1).
Feb  4 19:26:30 atlantis kernel: md/raid:md127: read error not correctable (sector 375049560 on sdc1).
Feb  4 19:26:30 atlantis kernel: md/raid:md127: read error not correctable (sector 375049568 on sdc1).
Feb  4 19:26:30 atlantis kernel: md/raid:md127: read error not correctable (sector 375049576 on sdc1).
Feb  4 19:26:30 atlantis kernel: md/raid:md127: read error not correctable (sector 375049584 on sdc1).
Feb  4 19:26:30 atlantis kernel: [ 1472.352139] md/raid:md127: read error not correctable (sector 375049584 on sdc1).
Feb  4 19:26:30 atlantis kernel: [ 1472.355113] md/raid:md127: read error not correctable (sector 375049592 on sdc1).
Feb  4 19:26:30 atlantis kernel: md/raid:md127: read error not correctable (sector 375049592 on sdc1).
Feb  4 19:26:30 atlantis kernel: [ 1472.357994] sd 2:0:0:0: [sdc]
Feb  4 19:26:30 atlantis kernel: [ 1472.360574] Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK
Feb  4 19:26:30 atlantis kernel: [ 1472.363229] sd 2:0:0:0: [sdc] CDB:
Feb  4 19:26:30 atlantis kernel: sd 2:0:0:0: [sdc]
Feb  4 19:26:30 atlantis kernel: Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK
Feb  4 19:26:30 atlantis kernel: sd 2:0:0:0: [sdc] CDB:
Feb  4 19:26:30 atlantis kernel: [ 1472.365664] Write(10): 2a 00 16 5a d9 28 00 04 00 00
Feb  4 19:26:30 atlantis kernel: Write(10): 2a 00 16 5a d9 28 00 04 00 00
Feb  4 19:27:30 atlantis kernel: [ 1532.836568] ata4.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 frozen
Feb  4 19:27:30 atlantis kernel: [ 1532.839292] ata4.00: failed command: FLUSH CACHE EXT
Feb  4 19:27:30 atlantis kernel: [ 1532.841987] ata4.00: cmd ea/00:00:00:00:00/00:00:00:00:00/a0 tag 16
Feb  4 19:27:30 atlantis kernel: [ 1532.841987]          res 40/00:00:01:4f:c2/00:00:00:00:00/00 Emask 0x4 (timeout)
Feb  4 19:27:30 atlantis kernel: ata4.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 frozen
Comment 113 frollic 2015-02-04 19:05:06 UTC
In addition, mobo is brand new (doesn't mean it can't be faulty), WDC drives are 2 months old (installed just before X-mas last year). The SSD was purchased used, so I can't tell you how old that is. 

All of the hardware, except for the Samsung SSD, ran just fine on my Supermicro X7SPA-H, before I swapped mobo just two days ago.
Comment 114 Ken-ichi Mito 2015-06-15 15:05:00 UTC
(In reply to Alex Williamson from comment #95)

I encountered same problem on PX-G128M6e (Plextor M6e series SSD) and resolved it by the patch.
(actually, I used the 4.0.5 kernel patched with the code described in https://lkml.org/lkml/2015/2/2/226 )

Booting with the ssd and passthrough the ssd to a guest OS both work correctly.

My system is Asus H97M-PLUS with Bios 2501 and PX-G128M6e with firmware revision 1.06.
The kernel .config is Arch's linux 4.0.5-1 package.
Comment 115 Ken-ichi Mito 2015-06-15 15:11:11 UTC
Created attachment 179951 [details]
dmesg of 4.0.5 vanilla kernel with iommu=on

`grep -i -e dmar -e iommu` is below

[    0.000000] Command line: BOOT_IMAGE=/vmlinuz-linux-vanilla root=UUID=8445003e-6304-4d86-b970-2afa31781a9b rw intel_iommu=on
[    0.000000] ACPI: DMAR 0x00000000DAC6CED0 0000B8 (v01 INTEL  BDW      00000001 INTL 00000001)
[    0.000000] Kernel command line: BOOT_IMAGE=/vmlinuz-linux-vanilla root=UUID=8445003e-6304-4d86-b970-2afa31781a9b rw intel_iommu=on
[    0.000000] Intel-IOMMU: enabled
[    0.107086] dmar: Host address width 39
[    0.107098] dmar: DRHD base: 0x000000fed90000 flags: 0x0
[    0.107123] dmar: IOMMU 0: reg_base_addr fed90000 ver 1:0 cap c0000020660462 ecap f0101a
[    0.107138] dmar: DRHD base: 0x000000fed91000 flags: 0x1
[    0.107154] dmar: IOMMU 1: reg_base_addr fed91000 ver 1:0 cap d2008c20660462 ecap f010da
[    0.107169] dmar: RMRR base: 0x000000dbe7b000 end: 0x000000dbe89fff
[    0.107179] dmar: RMRR base: 0x000000dd000000 end: 0x000000df1fffff
[    0.107191] IOAPIC id 8 under DRHD base  0xfed91000 IOMMU 1
[    0.685402] DMAR: No ATSR found
[    0.685642] IOMMU: dmar0 using Queued invalidation
[    0.685651] IOMMU: dmar1 using Queued invalidation
[    0.685662] IOMMU: Setting RMRR:
[    0.685694] IOMMU: Setting identity map for device 0000:00:02.0 [0xdd000000 - 0xdf1fffff]
[    0.686154] IOMMU: Setting identity map for device 0000:00:14.0 [0xdbe7b000 - 0xdbe89fff]
[    0.686215] IOMMU: Setting identity map for device 0000:00:1a.0 [0xdbe7b000 - 0xdbe89fff]
[    0.686268] IOMMU: Setting identity map for device 0000:00:1d.0 [0xdbe7b000 - 0xdbe89fff]
[    0.686308] IOMMU: Prepare 0-16MiB unity mapping for LPC
[    0.686329] IOMMU: Setting identity map for device 0000:00:1f.0 [0x0 - 0xffffff]
[    0.847930] dmar: DRHD: handling fault status reg 2
[    0.848264] dmar: DMAR:[DMA Write] Request device [02:00.1] fault addr fffe0000 
               DMAR:[fault reason 02] Present bit in context entry is clear
[    1.161006] dmar: DRHD: handling fault status reg 3
[    1.161963] dmar: DMAR:[DMA Write] Request device [02:00.1] fault addr fffe0000 
               DMAR:[fault reason 02] Present bit in context entry is clear
[    6.159656] dmar: DRHD: handling fault status reg 2
[    6.160750] dmar: DMAR:[DMA Write] Request device [02:00.1] fault addr fffe0000 
               DMAR:[fault reason 02] Present bit in context entry is clear
[    6.472980] dmar: DRHD: handling fault status reg 3
[    6.473513] dmar: DMAR:[DMA Write] Request device [02:00.1] fault addr fffe0000 
               DMAR:[fault reason 02] Present bit in context entry is clear
[   11.471329] dmar: DRHD: handling fault status reg 2
[   11.471661] dmar: DMAR:[DMA Write] Request device [02:00.1] fault addr fffe0000 
               DMAR:[fault reason 02] Present bit in context entry is clear
[   11.784476] dmar: DRHD: handling fault status reg 3
[   11.785472] dmar: DMAR:[DMA Write] Request device [02:00.1] fault addr fffe0000 
               DMAR:[fault reason 02] Present bit in context entry is clear
[   16.783038] dmar: DRHD: handling fault status reg 2
[   16.783646] dmar: DMAR:[DMA Write] Request device [02:00.1] fault addr fffe0000 
               DMAR:[fault reason 02] Present bit in context entry is clear
[   74.080032] [drm] DMAR active, disabling use of stolen memory
Comment 116 Ken-ichi Mito 2015-06-15 15:13:28 UTC
Created attachment 179961 [details]
dmesg of 4.0.5 patched kernel with iommu=on

`grep -i -e dmar -e iommu` is below

[    0.000000] Command line: BOOT_IMAGE=/vmlinuz-linux-m6e root=UUID=8445003e-6304-4d86-b970-2afa31781a9b rw intel_iommu=on
[    0.000000] ACPI: DMAR 0x00000000DAC6CED0 0000B8 (v01 INTEL  BDW      00000001 INTL 00000001)
[    0.000000] Kernel command line: BOOT_IMAGE=/vmlinuz-linux-m6e root=UUID=8445003e-6304-4d86-b970-2afa31781a9b rw intel_iommu=on
[    0.000000] Intel-IOMMU: enabled
[    0.107025] dmar: Host address width 39
[    0.107037] dmar: DRHD base: 0x000000fed90000 flags: 0x0
[    0.107060] dmar: IOMMU 0: reg_base_addr fed90000 ver 1:0 cap c0000020660462 ecap f0101a
[    0.107075] dmar: DRHD base: 0x000000fed91000 flags: 0x1
[    0.107092] dmar: IOMMU 1: reg_base_addr fed91000 ver 1:0 cap d2008c20660462 ecap f010da
[    0.107107] dmar: RMRR base: 0x000000dbe7b000 end: 0x000000dbe89fff
[    0.107117] dmar: RMRR base: 0x000000dd000000 end: 0x000000df1fffff
[    0.107129] IOAPIC id 8 under DRHD base  0xfed91000 IOMMU 1
[    0.688999] DMAR: No ATSR found
[    0.689240] IOMMU: dmar0 using Queued invalidation
[    0.689249] IOMMU: dmar1 using Queued invalidation
[    0.689259] IOMMU: Setting RMRR:
[    0.689292] IOMMU: Setting identity map for device 0000:00:02.0 [0xdd000000 - 0xdf1fffff]
[    0.689754] IOMMU: Setting identity map for device 0000:00:14.0 [0xdbe7b000 - 0xdbe89fff]
[    0.689816] IOMMU: Setting identity map for device 0000:00:1a.0 [0xdbe7b000 - 0xdbe89fff]
[    0.689868] IOMMU: Setting identity map for device 0000:00:1d.0 [0xdbe7b000 - 0xdbe89fff]
[    0.689908] IOMMU: Prepare 0-16MiB unity mapping for LPC
[    0.689930] IOMMU: Setting identity map for device 0000:00:1f.0 [0x0 - 0xffffff]
[   66.222474] [drm] DMAR active, disabling use of stolen memory
Comment 117 Ken-ichi Mito 2015-06-15 15:18:19 UTC
`lscpi -nnvv`

02:00.0 SATA controller [0106]: Lite-On IT Corp. / Plextor M6e PCI Express SSD [Marvell 88SS9183] [1c28:0122] (rev 14) (prog-if 01 [AHCI 1.0])
	Subsystem: Marvell Technology Group Ltd. Device [1b4b:9183]
	Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx+
	Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
	Latency: 0, Cache Line Size: 64 bytes
	Interrupt: pin A routed to IRQ 30
	Region 0: I/O ports at e050 [size=8]
	Region 1: I/O ports at e040 [size=4]
	Region 2: I/O ports at e030 [size=8]
	Region 3: I/O ports at e020 [size=4]
	Region 4: I/O ports at e000 [size=32]
	Region 5: Memory at f7c20000 (32-bit, non-prefetchable) [size=512]
	Expansion ROM at f7c00000 [disabled] [size=128K]
	Capabilities: [40] Power Management version 3
		Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA PME(D0-,D1-,D2-,D3hot+,D3cold-)
		Status: D0 NoSoftRst- PME-Enable- DSel=0 DScale=0 PME-
	Capabilities: [50] MSI: Enable+ Count=1/1 Maskable- 64bit-
		Address: fee00378  Data: 0000
	Capabilities: [70] Express (v2) Legacy Endpoint, MSI 00
		DevCap:	MaxPayload 512 bytes, PhantFunc 0, Latency L0s <1us, L1 unlimited
			ExtTag- AttnBtn- AttnInd- PwrInd- RBE+ FLReset-
		DevCtl:	Report errors: Correctable- Non-Fatal- Fatal- Unsupported-
			RlxdOrd- ExtTag- PhantFunc- AuxPwr- NoSnoop-
			MaxPayload 128 bytes, MaxReadReq 512 bytes
		DevSta:	CorrErr+ UncorrErr- FatalErr- UnsuppReq- AuxPwr- TransPend-
		LnkCap:	Port #0, Speed 5GT/s, Width x2, ASPM L0s L1, Exit Latency L0s <512ns, L1 <64us
			ClockPM+ Surprise- LLActRep- BwNot- ASPMOptComp-
		LnkCtl:	ASPM L0s L1 Enabled; RCB 64 bytes Disabled- CommClk+
			ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-
		LnkSta:	Speed 5GT/s, Width x2, TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt-
		DevCap2: Completion Timeout: Not Supported, TimeoutDis+, LTR-, OBFF Not Supported
		DevCtl2: Completion Timeout: 50us to 50ms, TimeoutDis-, LTR-, OBFF Disabled
		LnkCtl2: Target Link Speed: 5GT/s, EnterCompliance- SpeedDis-
			 Transmit Margin: Normal Operating Range, EnterModifiedCompliance- ComplianceSOS-
			 Compliance De-emphasis: -6dB
		LnkSta2: Current De-emphasis Level: -6dB, EqualizationComplete-, EqualizationPhase1-
			 EqualizationPhase2-, EqualizationPhase3-, LinkEqualizationRequest-
	Capabilities: [100 v1] Advanced Error Reporting
		UESta:	DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
		UEMsk:	DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
		UESvrt:	DLP+ SDES+ TLP- FCP+ CmpltTO- CmpltAbrt- UnxCmplt- RxOF+ MalfTLP+ ECRC- UnsupReq- ACSViol-
		CESta:	RxErr- BadTLP- BadDLLP+ Rollover- Timeout+ NonFatalErr-
		CEMsk:	RxErr- BadTLP- BadDLLP- Rollover- Timeout- NonFatalErr+
		AERCap:	First Error Pointer: 00, GenCap- CGenEn- ChkCap- ChkEn-
	Kernel driver in use: ahci
	Kernel modules: ahci

----

`lscpi -nnvv` on the host with passthrough the ssd to a guest OS

02:00.0 SATA controller [0106]: Lite-On IT Corp. / Plextor M6e PCI Express SSD [Marvell 88SS9183] [1c28:0122] (rev 14) (prog-if 01 [AHCI 1.0])
	Subsystem: Marvell Technology Group Ltd. Device [1b4b:9183]
	Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR+ FastB2B- DisINTx-
	Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
	Latency: 0, Cache Line Size: 64 bytes
	Interrupt: pin A routed to IRQ 16
	Region 0: I/O ports at e050 [size=8]
	Region 1: I/O ports at e040 [size=4]
	Region 2: I/O ports at e030 [size=8]
	Region 3: I/O ports at e020 [size=4]
	Region 4: I/O ports at e000 [size=32]
	Region 5: Memory at f7c20000 (32-bit, non-prefetchable) [size=512]
	Expansion ROM at f7c00000 [disabled] [size=128K]
	Capabilities: [40] Power Management version 3
		Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA PME(D0-,D1-,D2-,D3hot+,D3cold-)
		Status: D0 NoSoftRst- PME-Enable- DSel=0 DScale=0 PME-
	Capabilities: [50] MSI: Enable- Count=1/1 Maskable- 64bit-
		Address: 00000000  Data: 0000
	Capabilities: [70] Express (v2) Legacy Endpoint, MSI 00
		DevCap:	MaxPayload 512 bytes, PhantFunc 0, Latency L0s <1us, L1 <8us
			ExtTag- AttnBtn- AttnInd- PwrInd- RBE+ FLReset-
		DevCtl:	Report errors: Correctable- Non-Fatal- Fatal- Unsupported-
			RlxdOrd- ExtTag- PhantFunc- AuxPwr- NoSnoop-
			MaxPayload 128 bytes, MaxReadReq 512 bytes
		DevSta:	CorrErr+ UncorrErr- FatalErr- UnsuppReq- AuxPwr- TransPend-
		LnkCap:	Port #0, Speed 5GT/s, Width x2, ASPM L0s L1, Exit Latency L0s <512ns, L1 <64us
			ClockPM+ Surprise- LLActRep- BwNot- ASPMOptComp-
		LnkCtl:	ASPM L0s L1 Enabled; RCB 64 bytes Disabled- CommClk+
			ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-
		LnkSta:	Speed 5GT/s, Width x2, TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt-
		DevCap2: Completion Timeout: Not Supported, TimeoutDis+, LTR-, OBFF Not Supported
		DevCtl2: Completion Timeout: 50us to 50ms, TimeoutDis-, LTR-, OBFF Disabled
		LnkCtl2: Target Link Speed: 5GT/s, EnterCompliance- SpeedDis-
			 Transmit Margin: Normal Operating Range, EnterModifiedCompliance- ComplianceSOS-
			 Compliance De-emphasis: -6dB
		LnkSta2: Current De-emphasis Level: -6dB, EqualizationComplete-, EqualizationPhase1-
			 EqualizationPhase2-, EqualizationPhase3-, LinkEqualizationRequest-
	Capabilities: [e0] SATA HBA v0.0 BAR4 Offset=00000004
	Capabilities: [100 v1] Advanced Error Reporting
		UESta:	DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
		UEMsk:	DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
		UESvrt:	DLP+ SDES+ TLP- FCP+ CmpltTO- CmpltAbrt- UnxCmplt- RxOF+ MalfTLP+ ECRC- UnsupReq- ACSViol-
		CESta:	RxErr- BadTLP- BadDLLP+ Rollover- Timeout+ NonFatalErr-
		CEMsk:	RxErr- BadTLP- BadDLLP- Rollover- Timeout- NonFatalErr+
		AERCap:	First Error Pointer: 00, GenCap- CGenEn- ChkCap- ChkEn-
	Kernel driver in use: vfio-pci
	Kernel modules: ahci
Comment 118 Tasos Sahanidis 2015-06-23 02:35:58 UTC
I believe I am affected by the same bug with the Marvell 88SE9120 controller on an ASRock 990FX Extreme 4 motherboard.
Although there are no DMAR errors in dmesg, when AMD's IOMMU is enabled in the bios I get the following a couple of times, before it gives up

[  117.616423] ata9: hard resetting link
[  117.632972] AMD-Vi: Event logged [IO_PAGE_FAULT device=02:00.1 domain=0x0000 address=0x0000000000020440 flags=0x0070]
[  117.632982] AMD-Vi: Event logged [IO_PAGE_FAULT device=02:00.1 domain=0x0000 address=0x0000000000020450 flags=0x0070]
[  118.340472] AMD-Vi: Event logged [IO_PAGE_FAULT device=02:00.1 domain=0x0000 address=0x0000000000020000 flags=0x0050]
[  122.616621] ata9: softreset failed (1st FIS failed)
[  122.616632] ata9: reset failed, giving up
[  122.616640] ata9: EH complete

Once the controller's dev ID was added to drivers/pci/quirks.c everything worked as expected in kernel 4.1 from git.kernel.org (23b7776290b10297fe2cae0fb5f166a4f2c68121)

[ 1520.100391] ata9: hard resetting link
[ 1526.038156] ata9: SATA link up 3.0 Gbps (SStatus 123 SControl 330)
[ 1526.044554] ata9.00: ATA-7: SAMSUNG HD502IJ, 1AA01112, max UDMA7
[ 1526.044559] ata9.00: 976773168 sectors, multi 0: LBA48 NCQ (depth 31/32), AA
[ 1526.050996] ata9.00: configured for UDMA/133
[ 1526.051007] ata9: EH complete

And here is the patch

--- a/drivers/pci/quirks.c
+++ b/drivers/pci/quirks.c
@@ -3589,6 +3589,8 @@ DECLARE_PCI_FIXUP_HEADER(PCI_VENDOR_ID_MARVELL_EXT, 0x91a0,
 /* https://bugzilla.kernel.org/show_bug.cgi?id=42679#c49 */
 DECLARE_PCI_FIXUP_HEADER(PCI_VENDOR_ID_MARVELL_EXT, 0x9230,
 			 quirk_dma_func1_alias);
+DECLARE_PCI_FIXUP_HEADER(PCI_VENDOR_ID_MARVELL_EXT, 0x9120,
+			 quirk_dma_func1_alias);
 DECLARE_PCI_FIXUP_HEADER(PCI_VENDOR_ID_TTI, 0x0642,
 			 quirk_dma_func1_alias);
 /* https://bugs.gentoo.org/show_bug.cgi?id=497630 */

Could this device id be added to the list of affected devices?
Comment 119 Alex Williamson 2015-06-23 03:10:07 UTC
(In reply to Tasos Sahanidis from comment #118)
> 
> Could this device id be added to the list of affected devices?

It's already queued in the pull request for v4.2:

http://git.kernel.org/cgit/linux/kernel/git/helgaas/pci.git/commit/drivers/pci/quirks.c?id=247de694349c2eeea11b8d8936541f5012a09318
Comment 120 Tasos Sahanidis 2015-06-23 03:25:22 UTC
(In reply to Alex Williamson from comment #119)
> It's already queued in the pull request for v4.2:
> 
> http://git.kernel.org/cgit/linux/kernel/git/helgaas/pci.git/commit/drivers/
> pci/quirks.c?id=247de694349c2eeea11b8d8936541f5012a09318

Apologies for that, did not see it.
Thank you for your time!
Comment 121 bgh 2015-08-03 21:21:08 UTC
Hi.  Old Newbie to kernel things here.  I see from Alex's (initial?) patch at https://github.com/awilliam/linux-vfio/blob/02f8c6aee8df3cdc935e9bdd4f2d020306035dbe/drivers/ata/ahci.c that my 88SE9128 is in the quirks list.  

However, exploring at https://github.com/awilliam/linux-vfio/blob/02f8c6aee8df3cdc935e9bdd4f2d020306035dbe/drivers/ata/ahci.c  I don't see it. 

So - I'm probably looking in all the wrong places.

I've just set up Fedora 22 4.1.3-200.fc22.x86_64. I'm getting this fatal error.

ata10.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 frozen
ata10.00: failed command: WRITE DMA
ata10.00: cmd ca/00:01:08:08:00/00:00:00:00:00/e0 tag 5 dma 512 out#012
res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
ata10.00: status: { DRDY }
ata10: hard resetting link
ata10: link is slow to respond, please be patient (ready=0)
ata10: COMRESET failed (errno=-16)
ata10: hard resetting link
ata10: link is slow to respond, please be patient (ready=0)
ata10: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
ata10.00: qc timeout (cmd 0xec)
ata10.00: failed to IDENTIFY (I/O error, err_mask=0x4)
ata10.00: revalidation failed (errno=-5)
ata10: hard resetting link
ata11.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 frozen
ata11.00: failed command: READ DMA EXT
ata11.00: cmd 25/00:10:20:d5:c5/00:00:12:00:00/e0 tag 24 dma 8192 in#012
res 40/00:00:00:4f:c2/00:00:00:00:00/00 Emask 0x4 (timeout)
ata11.00: status: { DRDY }
ata11: hard resetting link
ata10: link is slow to respond, please be patient (ready=0)
ata11: link is slow to respond, please be patient (ready=0)
ata10: COMRESET failed (errno=-16)
ata10: hard resetting link
ata11: COMRESET failed (errno=-16)
ata11: hard resetting link

This is a StarTech PEXSAT31E1 add-on so it's not booting the system.  It's connected to a external cabinet, and I'm using mdadm for RAID-5. All drives report the same issues (logging not included here) which is what had me looking at the controller.

I am really hoping it's not included yet - which would both explain the issue and the fact that 'the fix is in'.

I've not built a kernel since - well, a long time ago - Ubuntu 6.10 or so. Now I might get a chance to try it on Fedora.

Please let me know if it would help if I provided more info.  Sure looks like I'm just like most others here...

Can anyone Help?

Many Thanks :-)
/Bill
Comment 122 bgh 2015-08-05 14:14:34 UTC
*bump*

I'm down here. I'm contemplating getting a 3ware and going the hardware route. I've had pretty horrid experience with Highpoint support (non-existent) and the Marvell controllers seem to be dysfunctional. Vendor who sold me the card could not provide any drivers or firmware updates, so this is my only possible path to a solution using this type of controller - the kernel patch(es).

Thanks.
Comment 123 frollic 2015-08-05 14:21:35 UTC
For the 9230 you might want to check the updated BIOS we've discussed at: 
http://homeservershow.com/forums/index.php?/topic/9179-marvell-9230-firmware-updates-and-such/
Comment 124 oh-itsme 2015-08-12 21:09:50 UTC
(In reply to frollic from comment #123)
> For the 9230 you might want to check the updated BIOS we've discussed at: 
> http://homeservershow.com/forums/index.php?/topic/9179-marvell-9230-firmware-
> updates-and-such/

I had found that thread in a websearch as I have encountered similar issues as you had, also using a Supermicro X10SBA. I had contacted Supermicro about this, but support did not really seem to be aware of this issue, and no update for the controller was sent to me. The thread you refer to does not state the outcome of applying the firmware to the X10SBA, does it solve the issue?
Comment 125 frollic 2015-08-13 07:11:30 UTC
(In reply to oh-itsme from comment #124)
> I had found that thread in a websearch as I have encountered similar issues
> as you had, also using a Supermicro X10SBA. I had contacted Supermicro about
> this, but support did not really seem to be aware of this issue, and no
> update for the controller was sent to me. 

I was in touch with the dutch support of Supermicro, they were very helpful, it took them about 10 days to obtain the update from Marvell.
The person I was in contact with wrote that the update would be posted along with the next BIOS update for the motherboard, but I don't think it actually happened :(

> The thread you refer to does not state the outcome of applying the firmware 
> to the X10SBA, does it solve the issue?

Yes it helpmed me, the soft-RAID is running fine now, even though I get occasional mismatch_cnt is not 0 on /dev/mdXXX when running raid-check.
Comment 126 Tasos Sahanidis 2015-11-23 12:34:32 UTC
There seems to have been a regression sometime after the 4.3 tag (6a13feb9c82803e2b815eca72fa7a9f5561d7861) and before one of the commits on 2015-11-07 (as that's when my kernel was compiled), which causes the same errors in dmesg as Comment #118.
This results in the drives attached to the controller becoming inaccessible.

Please note that this time the quirk for my device is present in drivers/pci/quirks.c but it seems to have no effect.
Comment 127 Kevin Hunt 2017-12-29 04:58:14 UTC
Hi There

Just want to address a problem with Asrock Extreme 9 X79 with BIOS P4.00 platform and its Marvell 88SE9220 controller.

I expecience the same faults as the above DMAR faults when this controller is enabled.
However the problem appears to be resolved by adding a new entry in quirks.c

DECLARE_PCI_FIXUP_HEADER(PCI_VENDOR_ID_MARVELL_EXT, 0x9220,
                        quirk_dma_func1_alias);

Let me know if you need me to attach any logs of faults, at the moment I'm using a custom compiled kernel with the above fix on Arch Linux but can switch to a standard kernel.

Kind Regards,
Comment 128 Alan 2018-01-12 23:32:53 UTC
If you've got the quirk fix and done the testing then I would see Documentation/process/submittingpatches.rst and submit your quirk fix as a patch with an explanation of what it fixes. The change looks correct to me.

Send it to linux-ide@vger.kernel.org and it should get reviewed and merged

Alan
Comment 129 microsoftenator 2018-04-19 02:31:40 UTC
I can confirm that this issue occurs with the Marvell 88SE9128 controller on my Gigabyte GA-X59A-UD7 (rev2.0) motherboard. As with Kevin Hunt above, adding a new entry in quirks.c appears to resolve the issue.

Given the name of this bug, I was surprised that the 9128 wasn't in there.
Comment 130 microsoftenator 2018-04-19 02:52:21 UTC
Addendum to the above:

The 9128 *does* appear to be in quirks file for mainline, but not in the kernel provided by Arch Linux (4.15.15). It seems that was either added in 4.16 or Arch's patches removed it for some reason.
Comment 131 Bjorn Helgaas 2018-04-19 17:25:33 UTC
http://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=aa0082066343 for Marvell 9128 appeared in v4.16-rc1.

http://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=832e4e1f76b8 for Marvell 88SE9220 appeared in v4.17-rc1.

Are there any devices that are still broken in v4.17-rc1?  If not, maybe we can close this bug?
Comment 132 Signor Rossi 2018-05-14 18:09:13 UTC
(In reply to Bjorn Helgaas from comment #131)
> http://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/
> ?id=aa0082066343 for Marvell 9128 appeared in v4.16-rc1.
> 
> http://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/
> ?id=832e4e1f76b8 for Marvell 88SE9220 appeared in v4.17-rc1.
> 
> Are there any devices that are still broken in v4.17-rc1?  If not, maybe we
> can close this bug?


I still have this issue with a Marvell 88SE9230 and kernel v4.16.8 under Arch Linux. It's probably worth checking all their SATA Controllers before closing this bug: https://www.marvell.com/storage/system-solutions/
Comment 133 Bjorn Helgaas 2018-05-14 20:12:56 UTC
v4.16 already contains a quirk for the Marvell 88SE9230 (added by cc346a4714a5 ("PCI: Add function 1 DMA alias quirk for Marvell devices") way back in v3.16).

But from comment #44 and comments #49-#58, it sounds like the 9230 has other problems in addition to this one, so I suspect you're seeing those other problems.  If so, can you open a new bug for that and copy Joshua and Alex?  I took a quick look and didn't see a definitive resolution for the problems Joshua reported.

I'm going to close this one and if people see more problems that are resolved by quirk_dma_func1_alias(), they can add them here and reopen the bug.
Comment 134 f.bluethner 2018-08-12 11:41:19 UTC
I have this issue with "Marvell Technology Group Ltd. 88SS9183 PCIe SSD Controller" in my "Asus Rog Strix Z370-F Gaming" and solved it by adding "DECLARE_PCI_FIXUP_HEADER(PCI_VENDOR_ID_MARVELL_EXT, 0x9183,
quirk_dma_func1_alias);" to "quirk_dma_func1_alias()".
Comment 135 John Smith 2020-11-07 16:56:39 UTC
"Marvell Technology Group Ltd. 88SS9215 PCIe SSD Controller" have the same bug.
Fixed by:

DECLARE_PCI_FIXUP_HEADER(PCI_VENDOR_ID_MARVELL_EXT, 0x9215,
			 quirk_dma_func1_alias);
Comment 136 sbingner 2021-05-14 02:34:58 UTC
Also "Marvell Technology Group Ltd. 88SE9125 PCIe SATA 6.0 Gb/s controller [1b4b:9125]" - fixed with:

DECLARE_PCI_FIXUP_HEADER(PCI_VENDOR_ID_MARVELL_EXT, 0x9125,
			 quirk_dma_func1_alias);

Is this sufficient or should I open a new bug?
Comment 137 Alan 2021-05-14 13:09:31 UTC
Even better would be to make a git diff of it and then submit it with explanation to

axboe@kernel.dk and cc  linux-ide@vger.kernel.org

See:
https://www.kernel.org/doc/html/latest/process/submitting-patches.html
Comment 138 Tom Li 2021-11-16 14:38:03 UTC
(In reply to sbingner from comment #136)
> Also "Marvell Technology Group Ltd. 88SE9125 PCIe SATA 6.0 Gb/s controller
> [1b4b:9125]" - fixed with:
> 
> DECLARE_PCI_FIXUP_HEADER(PCI_VENDOR_ID_MARVELL_EXT, 0x9125,
>                        quirk_dma_func1_alias);
> 
> Is this sufficient or should I open a new bug?

I have the same hardware and was able to test and confirm the bug. I just submitted the patch to the Linux kernel maintainers. Hopefully it'll be accepted soon.

https://patchwork.kernel.org/project/linux-pci/patch/YZPA+gSsGWI6+xBP@work/
Comment 139 Tom Li 2022-01-23 18:32:36 UTC
(In reply to Tom Li from comment #138)
> (In reply to sbingner from comment #136)
> > Also "Marvell Technology Group Ltd. 88SE9125 PCIe SATA 6.0 Gb/s controller
> > [...]
> > Is this sufficient or should I open a new bug?
> 
> I have the same hardware and was able to test and confirm the bug. I just
> submitted the patch to the Linux kernel maintainers. Hopefully it'll be
> accepted soon.
> 
> https://patchwork.kernel.org/project/linux-pci/patch/YZPA+gSsGWI6+xBP@work/

Patch for 88SE9125 has been merged into the upstream kernel since Linux v5.17-rc1. 

https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=e445375882883f69018aa669b67cbb37ec873406

Greg K.H. has also queued this patch for Linux 4.4, 4.9, 4.14, 5.4, 5.10, 5.15, 5.16. The patch should appear in the next stable kernel release in each branch.
Comment 140 Tom Li 2022-01-29 02:35:43 UTC
(In reply to Tom Li from comment #139)
> Patch for 88SE9125 has been merged into the upstream kernel since Linux
> v5.17-rc1. 
> 
> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/
> ?id=e445375882883f69018aa669b67cbb37ec873406
> 
> Greg K.H. has also queued this patch for Linux 4.4, 4.9, 4.14, 5.4, 5.10,
> 5.15, 5.16. The patch should appear in the next stable kernel release in
> each branch.

My patch has just been included in Linux 4.4.300, 4.9.298, 4.14.263, 4.19.226, 5.4.174, 5.10.94, 5.15.17, and 5.16.3.