Bug 7768

Summary: data corruption with Opteron CPUs and nvidia chipsets//memory hole mapping / iommu related bug?!
Product: Platform Specific/Hardware Reporter: Christoph Anton Mitterer (calestyo)
Component: x86-64Assignee: Andi Kleen (andi-bz)
Severity: high CC: arekm, bdimm, bill-osdl.org-bugzilla, cw, halbert, iustin, joachim.deguara, kernelbug, Markus.Rechberger, muli, netllama, Nicolas.Mailhot, peter.wainwright, protasnb
Priority: P2    
Hardware: i386   
OS: Linux   
Kernel Version: at least until Tree: Mainline
Regression: ---
Attachments: Use sw-iommu per default on Nvidia boards First of all does anyone know if there were some advancements in Nvidias and AMDs research? Andy?

Description Christoph Anton Mitterer 2007-01-04 04:56:36 UTC

This bug is mainly discussed at lkml and for in depth informaiton you should
probably read the whole thread(s) there.
You can find the archive here:

Now a short description:
-I (and many others) found a data corruption issue that happens on AMD
Opteron / Nvidia chipset systems

-What happens: If one reads/writes large amounts of data there are errors.
We test this the following way: Create some test data (huge amounts
of),.. make md5sums of it (or with other hash algorithms), then verify
them over and over.
The test shoes differences (refer the lkml thread for more information
about this). Always at differnt files (!!!!). It may happen at read AND
write access <http://www.nvnews.net/vbulletin/showthread.php?t=82909#>.
Note that even for affected users the error occurs rarely (but this is
of course still far to often): My personal tests shows about the following:
Test data: 30GB (of random data), I verify sha512sum 50 times (that is
what I call one complete test). So I verify 30*50GB. In one complete
test there are about 1-3 files with differences. With about 100
corrupted bytes (at leas very low data sizes, far below an MB)

-It probably happens with all the nforce chipsets (see the lkml thread
where everybody tells his hardware)

-The reasons are not single hardware defects (dozens of hight quality
memory <http://www.nvnews.net/vbulletin/showthread.php?t=82909#>, CPU,
PCI bus, HDD bad block scans, PCI parity, ECC, etc. tests showed this,
and even with different hardware compontents the issue remained)

-It is probably not an Operating System related bug, although Windows
won't suffer from it. The reason therefore is, that windows is (too
stupid) ... I mean unable to use the hardware iommu at all.

-It happens with both, PATA and SATA disk. To be exact: It is may that
this has nothing special to do with harddisks at all.
It is probably PCI-DMA related (see lkml for more infos and reasons for
this thesis).

-Only users with much main memory (don't know the exact value by hard
and I'm to lazy to look it up)... say 4GB will suffer from this problem.
Why? Only users who need the memory hole mapping and the iommu will
suffer from the problem (this is why we think it is chipset related).

-We found two "workarounds" but these have both big problems:
Workaround 1: Disable Memory Hole Mapping in the system BIOS at all.
The issue no longer occurs, BUT you loose a big part of your main memory
(depending on the size of the memhole, which itself depends on the PCI
devices). In my case I loose 1,5GB from my 4GB. Most users will probably
loose 1GB.
=> inacceptable

Workaround 2: As told Windows won't suffer from the problem because it
always uses an software iommu. (btw: the same applies for Intel CPUs
with EMT64/Intel 64,.. these CPUs don't even have a hardware iommu).
Linux is able to use the hardware iommu (which of course accelerates the
whole system).
If you tell the kernel (Linux) to use a software iommu (with the kernel
parameter iommu=soft),.. the issue won't appear.
=> this is better than workaround 1 but still not really acceptable.
Why? There are some following problems:

The hardware iommu and systems with such big main memory is largely used
in computing centres. Those groups won't abdicate the hwiommu in
general, simply because some Opteron (and perhaps Athlon) / Nvidia
combinations make problems.
(I can tell this because I work at the Leibniz Supercomputing Centre,..
one of the largest in Europe)

But as we don't know the exact reason for the issue, we cannot
selectively switch the iommu=soft for affected
mainboards/chipsets/cpu-steppings/and alike.

We'd have to use a kernel wide iommu=soft as a catchall solution.
But it is highly unlikely that this is accepted by the Linux community
(not to talk about end users like the supercomputing centres) and I
don't want to talk about other OS'es.

Perhaps this might be solvable via BIOS fixes, but of course not by the
stupid-solution "disable hwiommu via the BIOS".
Perhaps the reason is a Linux kernel bug (although this is highly unlikely).
Last but not least,.. perhaps this is AMD Opteron/Athlon (Note: These
CPUs have the memory controllers directly integrated) issue and/or
Nvidia nforce chipset issue.


PS: Please post any other resources/links to threads about this or
similar problems.
Comment 1 Chris Wedgwood 2007-01-16 18:22:29 UTC
A recent update to this:

From: Andi Kleen <ak@suse.de>
Date: Wed, 17 Jan 2007 08:29:53 +1100
Subject: Re: data corruption with nvidia chipsets and IDE/SATA drives (k8 cpu
        errata needed?)
Message-Id: <200701170829.54540.ak@suse.de>

AMD is looking at the issue. Only Nvidia chipsets seem to be affected,
although there were similar problems on VIA in the past too.
Unless a good workaround comes around soon I'll probably default
to iommu=soft on Nvidia.
Comment 2 Marc Perkel 2007-01-17 07:24:05 UTC
I too am seeing this problem on AM2 Athlon X2 ASUS boards. However it is not a
problem with 939 pin Athlon X2 boards. Both running nVidia chips. I tested this
a several boards and it's consistent so I'm convinced it's not my hardware
that's bad.

As discussed here. It ONLY happens with 4g ram. It also seems to be triggered by
disk io - generally running large rsync jobs. And iommu=soft fixes it.
Comment 3 Joachim Deguara 2007-01-18 01:10:42 UTC
Marc and Chris, do your boards both have the Nvidia CK804 SATA controller?
Comment 4 Arkadiusz Miskiewicz 2007-01-18 02:00:51 UTC
Note that this problem is visible not only SATA. In my case I was using qlogic 
2312 fibre channel card and (not used but installed) adaptec controller in taro 
slot. No SATA harddisks. 8GB of ram.


Example corrution diffs: http://ep09.pld-linux.org/~arekm/qlogic/

# lspci
00:00.0 Memory controller: nVidia Corporation CK804 Memory Controller (rev a3)
00:01.0 ISA bridge: nVidia Corporation CK804 ISA Bridge (rev a3)
00:01.1 SMBus: nVidia Corporation CK804 SMBus (rev a2)
00:02.0 USB Controller: nVidia Corporation CK804 USB Controller (rev a2)
00:02.1 USB Controller: nVidia Corporation CK804 USB Controller (rev a3)
00:06.0 IDE interface: nVidia Corporation CK804 IDE (rev f2)
00:07.0 IDE interface: nVidia Corporation CK804 Serial ATA Controller (rev f3)
00:08.0 IDE interface: nVidia Corporation CK804 Serial ATA Controller (rev f3)
00:09.0 PCI bridge: nVidia Corporation CK804 PCI Bridge (rev a2)
00:0e.0 PCI bridge: nVidia Corporation CK804 PCIE Bridge (rev a3)
00:18.0 Host bridge: Advanced Micro Devices [AMD] K8 [Athlon64/Opteron] 
HyperTransport Technology Configuration
00:18.1 Host bridge: Advanced Micro Devices [AMD] K8 [Athlon64/Opteron] Address 
00:18.2 Host bridge: Advanced Micro Devices [AMD] K8 [Athlon64/Opteron] DRAM 
00:18.3 Host bridge: Advanced Micro Devices [AMD] K8 [Athlon64/Opteron] 
Miscellaneous Control
00:19.0 Host bridge: Advanced Micro Devices [AMD] K8 [Athlon64/Opteron] 
HyperTransport Technology Configuration
00:19.1 Host bridge: Advanced Micro Devices [AMD] K8 [Athlon64/Opteron] Address 
00:19.2 Host bridge: Advanced Micro Devices [AMD] K8 [Athlon64/Opteron] DRAM 
00:19.3 Host bridge: Advanced Micro Devices [AMD] K8 [Athlon64/Opteron] 
Miscellaneous Control
01:07.0 VGA compatible controller: ATI Technologies Inc Rage XL (rev 27)
08:0a.0 PCI bridge: Advanced Micro Devices [AMD] AMD-8131 PCI-X Bridge (rev 12)
08:0a.1 PIC: Advanced Micro Devices [AMD] AMD-8131 PCI-X IOAPIC (rev 01)
08:0b.0 PCI bridge: Advanced Micro Devices [AMD] AMD-8131 PCI-X Bridge (rev 12)
08:0b.1 PIC: Advanced Micro Devices [AMD] AMD-8131 PCI-X IOAPIC (rev 01)
09:08.0 Fibre Channel: QLogic Corp. QLA2312 Fibre Channel Adapter (rev 02)
0a:09.0 Ethernet controller: Broadcom Corporation NetXtreme BCM5704 Gigabit 
Ethernet (rev 03)
0a:09.1 Ethernet controller: Broadcom Corporation NetXtreme BCM5704 Gigabit 
Ethernet (rev 03)

Adaptec controller was removed and that somehow cured the problem.

IOMMU is used:
PCI-DMA: Disabling AGP.
PCI-DMA: aperture base @ d8000000 size 65536 KB
PCI-DMA: Reserving 64MB of IOMMU area in the AGP aperture
Comment 5 Joachim Deguara 2007-01-18 02:20:35 UTC
Arkadiusz, from your diffs it looks like you are getting single bit flips and
does not appear to be related to this bug.  Does it go away when you set
"iommu=soft"?  It may just be a hardware issue as you mentioned removing a
different card in you system made the problem go away (perhaps the PSU was
Comment 6 Christoph Anton Mitterer 2007-01-18 06:23:12 UTC
>Marc and Chris, do your boards both have the Nvidia CK804 SATA controller?
00:06.0 IDE interface: nVidia Corporation CK804 IDE (rev f2)
00:07.0 IDE interface: nVidia Corporation CK804 Serial ATA Controller (rev f3)
00:08.0 IDE interface: nVidia Corporation CK804 Serial ATA Controller (rev f3)

Yes, I have one.
Comment 7 Arkadiusz Miskiewicz 2007-01-18 07:44:21 UTC
Joachim: I cannot test iommu=soft unfortunately. Machines are in production for 
few months now.

The problem was visible with combination:
problematic: qlogic card + adaptec in taro slot (both PCI-X but taro slot has 
propertiary connector)
working fine: lsi megaraid scsi PCI-E + adaptec in taro slot

Maybe it's differend bug... maybe not.
Comment 8 Lonni J Friedman 2007-01-18 07:49:06 UTC
Arkadiusz, could you post the specific models of Qlogic, Adaptec & LSI card that
you're using?  thanks.
Comment 9 Arkadiusz Miskiewicz 2007-01-18 08:24:43 UTC
intel srcu42e raid controller, PCI-E (lsi megaraid), driver reports: megaraid: fw 
version:[514P] bios version:[H431]

scsi0 : qla2xxx
qla2300 0000:09:08.0:
 QLogic Fibre Channel HBA Driver: 8.01.04-k-fw
  QLogic QLA2340 -
  ISP2312: PCI-X (133 MHz) @ 0000:09:08.0 hdma+, host#=0, fw=3.03.18 IPX

adaptec Adaptec AIC-7901 U320 (rev 10) in taro slot:

scsi0 : Adaptec AIC79XX PCI-X SCSI HBA DRIVER, Rev 3.0
        <Adaptec AIC7901 Ultra320 SCSI adapter>
        aic7901: Ultra320 Wide Channel A, SCSI Id=7, PCI-X 101-133Mhz, 512 SCBs

the platform is tyan gt24b2891:
with TYAN Thunder K8SRE mainboard,
two dual core 270 opterons and 6GB ram
Comment 10 Peter Wainwright 2007-01-26 02:00:07 UTC
This may be a separate bug, but in case it is related, here goes:
I also suffer from data corruption when writing large files.
I have 8GB (of which >5GB is presently in use by a large
long-running computational electromagnetics program,
so I'm definitely using high memory).  This is with an HP xw9300
with dual Opteron 285, Nvidia chipset including CK804 SATA

However, I also get these messages:

Jan 26 09:58:31 tux kernel: PCI-DMA: Out of IOMMU space for 524288 bytes at
device 0000:00:07.0
Jan 26 09:58:31 tux kernel: ata1: status=0x50 { DriveReady SeekComplete }

whereas if I understand it your corruption occurs silently?


Comment 11 Peter Wainwright 2007-01-26 02:04:57 UTC
Sorry, forgot to add that this is RHEL 4, so kernel
is Red Hat's 2.6.9-42.0.3.ELsmp.

Comment 12 Bill Dimm 2007-03-07 20:43:16 UTC
I've run into what seems to be the same bug, but my test results show some
differences that might be meaningful.  Quick summary:
1) iommu=soft boot parameter does fix the problem
2) Changing Memhole Map to disabled in BIOS does NOT fix the problem
3) Problem is very frequent when computer is freshly booted.  After several
passes, the problem goes away, or at least becomes so infrequent that my simple
test doesn't show it
4) Occasionally (a few percent of the time), computer hangs completely, early in
the test after a fresh boot.
5) Position of corruption is a multiple of 4096 characters into the file, and
corruption covers a whole 4096 character chunk.  Sometimes most bits are set to
1, and sometimes most are set to 0, but it is never purely 1 or 0.

64-bit OpenSUSE 10.2 (kernel
Motherboard: Tyan S2895
BIOS: Phoenix 1.03.2895
CPUs: 2x Opteron 270 (dual core)
Memory: 8GB - 4x 2GB Crucial DDR400
Drive: Seagate SATA - ST3400633AS

I created 6 identical files on disk.  Each was 2.1GB, so I assume they will each
need to be re-read from disk if I do "md5sum *" repeatedly, since there isn't
enough memory to cache them. The correct MD5 is known to be
7733a403dbe0de660362928f504cb9fc.  Here are some typical results after a fresh boot:

> md5sum *
3211a7fdf6f95d4b1b2dcd13623b89c3  file_1
a6dee9c115dbee14770ce031ce40c52e  file_2
e878762de270e4bc801004338ec05d60  file_3
928348176a8c51a89eb745bb39b2b4b2  file_4
4716a23d2645438c7e7764fc5493c07b  file_5
275551c09c2776fc9b2bb332a24ce54f  file_6

> md5sum *
f2ba6a50d586cf12e3ed6357ce04b121  file_1
49afcf87aca5fa923f7975dae687cfa5  file_2
7733a403dbe0de660362928f504cb9fc  file_3
7733a403dbe0de660362928f504cb9fc  file_4
9c81b97410bd8954010ae79979a7fe90  file_5
9cbef741d1af565a14f5f7c726801eff  file_6

> md5sum *
7733a403dbe0de660362928f504cb9fc  file_1
7733a403dbe0de660362928f504cb9fc  file_2
6f9fd3c073eef375b8da26be91465886  file_3
86f2141fbc523f6ededc128273e071e5  file_4
b3b20ec47d5cf70573aa207afe08004b  file_5
943864ecde3880a2be1e718162ba873a  file_6

> md5sum *
7733a403dbe0de660362928f504cb9fc  file_1
7733a403dbe0de660362928f504cb9fc  file_2
7733a403dbe0de660362928f504cb9fc  file_3
7733a403dbe0de660362928f504cb9fc  file_4
7733a403dbe0de660362928f504cb9fc  file_5
7733a403dbe0de660362928f504cb9fc  file_6

> md5sum *
7733a403dbe0de660362928f504cb9fc  file_1
7733a403dbe0de660362928f504cb9fc  file_2
7733a403dbe0de660362928f504cb9fc  file_3
7733a403dbe0de660362928f504cb9fc  file_4
7733a403dbe0de660362928f504cb9fc  file_5
7733a403dbe0de660362928f504cb9fc  file_6

Once it starts giving correct values for all files, it seems to continue doing
so until the next reboot, at least for the handful of iterations that I did

Below is a sample "cmp -l" of two of the files after a fresh boot.  The files
are actually bzip2 files, so the bits are expected to look pretty random.  Both
files are expected to read from disk incorrectly after a fresh boot.  The
corrupted section should be the one where the bits don't look random.  In this
example, the first file has one corrupted 4096-character segment, and the second
file has two such segments.  I've used "..." to denote parts of the
4096-character section that I've omitted.

> cmp -l file_1 file_2
554016769 357 202
554016770 357  55
554016771 277 206
554016772 177  25
554016773 177 224
554016774 377 236
554016775 377  72
554016776 357 364
554016777 377 244
554016778 377  45
554016779 277 355
554020861 377 335
554020862 367 227
554020863 235 254
554020864 377 257
761208833 355 237
761208834  50 376
761208835 304 175
761208836 145 375
761208837  34 377
761208838 243 377
761208839  67 377
761208840 123 377
761212925 236 377
761212926 321 377
761212927 234 376
761212928  26 277
852713473 375 357
852713474  12 355
852713475 330 374
852713476 205 377
852713477 271 377
852713478  47 377
852713479 227 377
Comment 13 acquahmayer 2007-03-09 03:03:27 UTC
see http://www.uwsg.indiana.edu/hypermail/linux/kernel/0702.2/2069.html

Despite we have seen mostly 4K blocks filled with nulls, occasionally
we have seen what seemed random content (which would be explained
by stale cache)

Also the problem was volatile, on identical machines sometimes
it never showed up, but did repeatedly on others. After reboot
this could change.

Now running with iommu=soft until resolved.
Comment 14 Filipus Klutiero 2007-03-12 14:50:22 UTC
With which version(s) are you experiencing this problem?
Comment 15 Dan Halbert 2007-03-21 14:05:12 UTC
We have also been seeing similar problems, on Tyan S3870 boards. These are also
dual Opteron boards (we have dual 265's), BUT they use the ServerWorks BCM-5785
(HT-1000) chipset. These machines all have 4x2GB DIMMs.

We see random 4k blocks in large files corrupted in the pagecache. Rebooting
clears the problem, so it's just in the cache, not on disk. It may take several
days for an error to show up while these machines are in production. The problem
may occur more quickly after a warm reboot. We have about 50 of these machines
now, and it occurs randomly across them.

We also more occasionally see ext3 journaling errors or other I/O error crashes.

We've tried various fixes, including booting "noapic" and various IOMMU and
memory hole BIOS options, all to no avail. Lowering the memory clock seems to
reduce the occurrence but does not eliminate it.

We will try "iommu=soft".

This is using the Centos 2.6.9-42.0.10.ELsmp x86_64 kernel.

Here's our thread about this, on the Centos list:
Comment 16 Christoph Anton Mitterer 2007-03-22 05:22:56 UTC
Created attachment 10908 [details]
Use sw-iommu per default on Nvidia boards

First of all does anyone know if there were some advancements in Nvidias and AMDs research? Andy?

First of all does anyone know if there were some advancements in Nvidias and
AMDs research? Andy?

This is the patch as written by Steve Langasek from Debian.
Comment 17 Christoph Anton Mitterer 2007-03-22 05:25:45 UTC

First of all does anyone know if there were some advancements in Nvidias and
AMDs research? Andy?

Secondly: I've reported this bug also to Debian because I'd like to see it
"fixed" (well I'd better say workarounded in etch).
See here for the bug report and all answers

Now Steve Langasek wrote a small patch that should disable the hw-iommu for all
nvidia boards per default.

Please test it and report your results.

Andi, you told weeks ago that you'd think about using sw-iommu per default if we
don't find a real solution soon.
I think the time has come. If someday Nvidia, AMD or somebody else find a real
solution we could use the hwiommu again.

I've attached the patch from Steve above, but perhaps we should also add an
reference to the bugzilla.kernel.org bug-# in it.

Best wishes,
Comment 18 Joachim Deguara 2007-03-22 05:30:39 UTC
AMD and Nvidia are still actively investigating this problem.  Currently we are
doing hardware traces which point to DMA happening after the OS thought that DMA
is finished and thus cleaned up it's GART mapping.  We are trying to pinpoint
why this is happening, but there is no solution or root cause as of yet.
Comment 19 Bill McGonigle 2007-03-22 08:06:13 UTC
I'm hoping the best solution is found in time for the next kernel release, but
if it's not we should err on the side of caution and use Steve's patch until a
better solution is found and tested.  

Best to give up some performance for data integrity.  I'm in the 'those who have
lost data' camp on this one.
Comment 20 Justin Finnerty 2007-03-28 03:44:54 UTC
Possible additional hardware risk from this issue:  We have a set of AMD/nvidia 
chipset machines installed in January 2007.  We have been observing the same 
problems as already mentioned.  *Importantly* we have had 3 disks report SMART 
errors and 3 different disks that "disappear" on a running machine and then are 
not seen during BIOS boot up. As this is 6 disks out of 25 I am concerned this 
failure rate is not due to faulty disk hardware but that it may be due to this 
issue.  Some details: Half the nodes have Suse 10.2 (unmodified
default) and half have Suse 10.0 with kernel compiled with the 
patch from http://www.uwsg.indiana.edu/hypermail/linux/kernel/0702.2/
2069.html.  Drives are Barracuda 7200.10 SATA ST3320620AS.  Disk faults are 
independent of kernel etc..
Comment 21 Joachim Deguara 2007-03-28 04:27:45 UTC
Justin, this would be really new if your SMART errors are somehow related
although I think they are not.  What we are looking at appears to be a bad DMA
transaction that basically gets one address wrong when going through the GART. 
This happens extremely rarely and appears to be very timing related, but should
only put the wrong 4kB of data on the disk and nothing worse (if that is not
worse enough!).
AMD is actively working on this and through hardware traces is in progress of
finding the root cause.  At this time no root cause has been established but
more on this hopefully soon.
Comment 22 Andi Kleen 2007-03-29 08:46:09 UTC
SMART errors is a broken disk. Nothing to see here, go away.
Comment 23 Arkadiusz Miskiewicz 2007-04-23 08:38:48 UTC
nvidia posted some patch

Comment 24 Dan Halbert 2007-04-23 13:44:59 UTC
re comment #23:

Actually, the patch is from AMD, and does not look Nvidia-specific. Here is a
shorter link to the lkml thread that is from groups.google.com instead of

Comment 25 Christoph Anton Mitterer 2007-04-23 15:02:42 UTC
Actually the patch was first posted here,... (probably after being sent to Andi):
Comment 26 Natalie Protasevich 2007-05-19 11:42:40 UTC
For all the parties affected by this bug and involved with testing: did the
patch resolve the issue? Any updates on the problem, testing results?
Comment 27 Joachim Deguara 2007-05-21 01:12:29 UTC
Thanks for the reminder, this bug has been solved from AMD's perspective as the
AGP specification did not permit a cached GART table and our patch resolved
this. This patch went into 2.6.21 as ommit
cf6387daf8858bdcb3e123034ca422e8979d73f1.  However please all do test to confirm
any paranoia as *silent* data corruption is one of those things that keeps us
awake at night. 
Andi please change the status of the bug.
Comment 28 Andi Kleen 2007-05-21 02:07:15 UTC
Fix merged in mainline
Comment 29 Eric Thiele 2007-08-21 02:40:37 UTC
The intersting thing is that i have still this kind of bug with an i386 Kernel on top of 2 Amd 285 with 12 GB of Memory, with less then 4GB it runs fine. Memcheck did not find anything. The proposed patch was for x86_64 only. I get bad_pages and curruption with xfs_writes on high load from 2.6.18 until now 2.6.22.

Comment 30 Joachim Deguara 2007-08-21 02:50:24 UTC
But if you are using i386 then the GART IOMMU is not used at all which was the root cause of this problem.  It sounds like you have another problem as you can not boot with iommu=soft to work around this on i386.  But we need to root cause your problem.  How reproducible is it and what steps are necessary to reproduce it.  Perhaps you should first open a second bug.