Hi. This bug is mainly discussed at lkml and for in depth informaiton you should probably read the whole thread(s) there. You can find the archive here: http://marc.theaimsgroup.com/?t=116502121800001&r=1&w=2 <http://marc.theaimsgroup.com/?t=116502121800001&r=1&w=2> Now a short description: -I (and many others) found a data corruption issue that happens on AMD Opteron / Nvidia chipset systems <http://www.nvnews.net/vbulletin/showthread.php?t=82909#>. -What happens: If one reads/writes large amounts of data there are errors. We test this the following way: Create some test data (huge amounts of),.. make md5sums of it (or with other hash algorithms), then verify them over and over. The test shoes differences (refer the lkml thread for more information about this). Always at differnt files (!!!!). It may happen at read AND write access <http://www.nvnews.net/vbulletin/showthread.php?t=82909#>. Note that even for affected users the error occurs rarely (but this is of course still far to often): My personal tests shows about the following: Test data: 30GB (of random data), I verify sha512sum 50 times (that is what I call one complete test). So I verify 30*50GB. In one complete test there are about 1-3 files with differences. With about 100 corrupted bytes (at leas very low data sizes, far below an MB) -It probably happens with all the nforce chipsets (see the lkml thread where everybody tells his hardware) -The reasons are not single hardware defects (dozens of hight quality memory <http://www.nvnews.net/vbulletin/showthread.php?t=82909#>, CPU, PCI bus, HDD bad block scans, PCI parity, ECC, etc. tests showed this, and even with different hardware compontents the issue remained) -It is probably not an Operating System related bug, although Windows won't suffer from it. The reason therefore is, that windows is (too stupid) ... I mean unable to use the hardware iommu at all. -It happens with both, PATA and SATA disk. To be exact: It is may that this has nothing special to do with harddisks at all. It is probably PCI-DMA related (see lkml for more infos and reasons for this thesis). -Only users with much main memory (don't know the exact value by hard and I'm to lazy to look it up)... say 4GB will suffer from this problem. Why? Only users who need the memory hole mapping and the iommu will suffer from the problem (this is why we think it is chipset related). -We found two "workarounds" but these have both big problems: Workaround 1: Disable Memory Hole Mapping in the system BIOS at all. The issue no longer occurs, BUT you loose a big part of your main memory (depending on the size of the memhole, which itself depends on the PCI devices). In my case I loose 1,5GB from my 4GB. Most users will probably loose 1GB. => inacceptable Workaround 2: As told Windows won't suffer from the problem because it always uses an software iommu. (btw: the same applies for Intel CPUs with EMT64/Intel 64,.. these CPUs don't even have a hardware iommu). Linux is able to use the hardware iommu (which of course accelerates the whole system). If you tell the kernel (Linux) to use a software iommu (with the kernel parameter iommu=soft),.. the issue won't appear. => this is better than workaround 1 but still not really acceptable. Why? There are some following problems: The hardware iommu and systems with such big main memory is largely used in computing centres. Those groups won't abdicate the hwiommu in general, simply because some Opteron (and perhaps Athlon) / Nvidia combinations make problems. (I can tell this because I work at the Leibniz Supercomputing Centre,.. one of the largest in Europe) But as we don't know the exact reason for the issue, we cannot selectively switch the iommu=soft for affected mainboards/chipsets/cpu-steppings/and alike. We'd have to use a kernel wide iommu=soft as a catchall solution. But it is highly unlikely that this is accepted by the Linux community (not to talk about end users like the supercomputing centres) and I don't want to talk about other OS'es. Perhaps this might be solvable via BIOS fixes, but of course not by the stupid-solution "disable hwiommu via the BIOS". Perhaps the reason is a Linux kernel bug (although this is highly unlikely). Last but not least,.. perhaps this is AMD Opteron/Athlon (Note: These CPUs have the memory controllers directly integrated) issue and/or Nvidia nforce chipset issue. Regards, Chris. PS: Please post any other resources/links to threads about this or similar problems.
A recent update to this: From: Andi Kleen <ak@suse.de> Date: Wed, 17 Jan 2007 08:29:53 +1100 Subject: Re: data corruption with nvidia chipsets and IDE/SATA drives (k8 cpu errata needed?) Message-Id: <200701170829.54540.ak@suse.de> [...] AMD is looking at the issue. Only Nvidia chipsets seem to be affected, although there were similar problems on VIA in the past too. Unless a good workaround comes around soon I'll probably default to iommu=soft on Nvidia.
I too am seeing this problem on AM2 Athlon X2 ASUS boards. However it is not a problem with 939 pin Athlon X2 boards. Both running nVidia chips. I tested this a several boards and it's consistent so I'm convinced it's not my hardware that's bad. As discussed here. It ONLY happens with 4g ram. It also seems to be triggered by disk io - generally running large rsync jobs. And iommu=soft fixes it.
Marc and Chris, do your boards both have the Nvidia CK804 SATA controller?
Note that this problem is visible not only SATA. In my case I was using qlogic 2312 fibre channel card and (not used but installed) adaptec controller in taro slot. No SATA harddisks. 8GB of ram. http://groups.google.com/group/fa.linux.kernel/browse_thread/thread/ b8bdbde9721f7d35/a3bd93dded549eca Example corrution diffs: http://ep09.pld-linux.org/~arekm/qlogic/ # lspci 00:00.0 Memory controller: nVidia Corporation CK804 Memory Controller (rev a3) 00:01.0 ISA bridge: nVidia Corporation CK804 ISA Bridge (rev a3) 00:01.1 SMBus: nVidia Corporation CK804 SMBus (rev a2) 00:02.0 USB Controller: nVidia Corporation CK804 USB Controller (rev a2) 00:02.1 USB Controller: nVidia Corporation CK804 USB Controller (rev a3) 00:06.0 IDE interface: nVidia Corporation CK804 IDE (rev f2) 00:07.0 IDE interface: nVidia Corporation CK804 Serial ATA Controller (rev f3) 00:08.0 IDE interface: nVidia Corporation CK804 Serial ATA Controller (rev f3) 00:09.0 PCI bridge: nVidia Corporation CK804 PCI Bridge (rev a2) 00:0e.0 PCI bridge: nVidia Corporation CK804 PCIE Bridge (rev a3) 00:18.0 Host bridge: Advanced Micro Devices [AMD] K8 [Athlon64/Opteron] HyperTransport Technology Configuration 00:18.1 Host bridge: Advanced Micro Devices [AMD] K8 [Athlon64/Opteron] Address Map 00:18.2 Host bridge: Advanced Micro Devices [AMD] K8 [Athlon64/Opteron] DRAM Controller 00:18.3 Host bridge: Advanced Micro Devices [AMD] K8 [Athlon64/Opteron] Miscellaneous Control 00:19.0 Host bridge: Advanced Micro Devices [AMD] K8 [Athlon64/Opteron] HyperTransport Technology Configuration 00:19.1 Host bridge: Advanced Micro Devices [AMD] K8 [Athlon64/Opteron] Address Map 00:19.2 Host bridge: Advanced Micro Devices [AMD] K8 [Athlon64/Opteron] DRAM Controller 00:19.3 Host bridge: Advanced Micro Devices [AMD] K8 [Athlon64/Opteron] Miscellaneous Control 01:07.0 VGA compatible controller: ATI Technologies Inc Rage XL (rev 27) 08:0a.0 PCI bridge: Advanced Micro Devices [AMD] AMD-8131 PCI-X Bridge (rev 12) 08:0a.1 PIC: Advanced Micro Devices [AMD] AMD-8131 PCI-X IOAPIC (rev 01) 08:0b.0 PCI bridge: Advanced Micro Devices [AMD] AMD-8131 PCI-X Bridge (rev 12) 08:0b.1 PIC: Advanced Micro Devices [AMD] AMD-8131 PCI-X IOAPIC (rev 01) 09:08.0 Fibre Channel: QLogic Corp. QLA2312 Fibre Channel Adapter (rev 02) 0a:09.0 Ethernet controller: Broadcom Corporation NetXtreme BCM5704 Gigabit Ethernet (rev 03) 0a:09.1 Ethernet controller: Broadcom Corporation NetXtreme BCM5704 Gigabit Ethernet (rev 03) Adaptec controller was removed and that somehow cured the problem. IOMMU is used: PCI-DMA: Disabling AGP. PCI-DMA: aperture base @ d8000000 size 65536 KB PCI-DMA: using GART IOMMU. PCI-DMA: Reserving 64MB of IOMMU area in the AGP aperture
Arkadiusz, from your diffs it looks like you are getting single bit flips and does not appear to be related to this bug. Does it go away when you set "iommu=soft"? It may just be a hardware issue as you mentioned removing a different card in you system made the problem go away (perhaps the PSU was stressed???).
>Marc and Chris, do your boards both have the Nvidia CK804 SATA controller? 00:06.0 IDE interface: nVidia Corporation CK804 IDE (rev f2) 00:07.0 IDE interface: nVidia Corporation CK804 Serial ATA Controller (rev f3) 00:08.0 IDE interface: nVidia Corporation CK804 Serial ATA Controller (rev f3) Yes, I have one.
Joachim: I cannot test iommu=soft unfortunately. Machines are in production for few months now. The problem was visible with combination: problematic: qlogic card + adaptec in taro slot (both PCI-X but taro slot has propertiary connector) working fine: lsi megaraid scsi PCI-E + adaptec in taro slot Maybe it's differend bug... maybe not.
Arkadiusz, could you post the specific models of Qlogic, Adaptec & LSI card that you're using? thanks.
intel srcu42e raid controller, PCI-E (lsi megaraid), driver reports: megaraid: fw version:[514P] bios version:[H431] http://www.intel.com/design/servers/RAID/srcu42e/ scsi0 : qla2xxx qla2300 0000:09:08.0: QLogic Fibre Channel HBA Driver: 8.01.04-k-fw QLogic QLA2340 - ISP2312: PCI-X (133 MHz) @ 0000:09:08.0 hdma+, host#=0, fw=3.03.18 IPX adaptec Adaptec AIC-7901 U320 (rev 10) in taro slot: http://www.tyan.com/products/html/m7901.html scsi0 : Adaptec AIC79XX PCI-X SCSI HBA DRIVER, Rev 3.0 <Adaptec AIC7901 Ultra320 SCSI adapter> aic7901: Ultra320 Wide Channel A, SCSI Id=7, PCI-X 101-133Mhz, 512 SCBs the platform is tyan gt24b2891: http://www.tyan.com/products/html/gt24b2891.html with TYAN Thunder K8SRE mainboard, two dual core 270 opterons and 6GB ram
This may be a separate bug, but in case it is related, here goes: I also suffer from data corruption when writing large files. I have 8GB (of which >5GB is presently in use by a large long-running computational electromagnetics program, so I'm definitely using high memory). This is with an HP xw9300 with dual Opteron 285, Nvidia chipset including CK804 SATA controller. However, I also get these messages: Jan 26 09:58:31 tux kernel: PCI-DMA: Out of IOMMU space for 524288 bytes at device 0000:00:07.0 Jan 26 09:58:31 tux kernel: ata1: status=0x50 { DriveReady SeekComplete } whereas if I understand it your corruption occurs silently? Peter
Sorry, forgot to add that this is RHEL 4, so kernel is Red Hat's 2.6.9-42.0.3.ELsmp. Peter
I've run into what seems to be the same bug, but my test results show some differences that might be meaningful. Quick summary: 1) iommu=soft boot parameter does fix the problem 2) Changing Memhole Map to disabled in BIOS does NOT fix the problem 3) Problem is very frequent when computer is freshly booted. After several passes, the problem goes away, or at least becomes so infrequent that my simple test doesn't show it 4) Occasionally (a few percent of the time), computer hangs completely, early in the test after a fresh boot. 5) Position of corruption is a multiple of 4096 characters into the file, and corruption covers a whole 4096 character chunk. Sometimes most bits are set to 1, and sometimes most are set to 0, but it is never purely 1 or 0. System: 64-bit OpenSUSE 10.2 (kernel 2.6.18.2-34) Motherboard: Tyan S2895 BIOS: Phoenix 1.03.2895 CPUs: 2x Opteron 270 (dual core) Memory: 8GB - 4x 2GB Crucial DDR400 Drive: Seagate SATA - ST3400633AS I created 6 identical files on disk. Each was 2.1GB, so I assume they will each need to be re-read from disk if I do "md5sum *" repeatedly, since there isn't enough memory to cache them. The correct MD5 is known to be 7733a403dbe0de660362928f504cb9fc. Here are some typical results after a fresh boot: > md5sum * 3211a7fdf6f95d4b1b2dcd13623b89c3 file_1 a6dee9c115dbee14770ce031ce40c52e file_2 e878762de270e4bc801004338ec05d60 file_3 928348176a8c51a89eb745bb39b2b4b2 file_4 4716a23d2645438c7e7764fc5493c07b file_5 275551c09c2776fc9b2bb332a24ce54f file_6 > md5sum * f2ba6a50d586cf12e3ed6357ce04b121 file_1 49afcf87aca5fa923f7975dae687cfa5 file_2 7733a403dbe0de660362928f504cb9fc file_3 7733a403dbe0de660362928f504cb9fc file_4 9c81b97410bd8954010ae79979a7fe90 file_5 9cbef741d1af565a14f5f7c726801eff file_6 > md5sum * 7733a403dbe0de660362928f504cb9fc file_1 7733a403dbe0de660362928f504cb9fc file_2 6f9fd3c073eef375b8da26be91465886 file_3 86f2141fbc523f6ededc128273e071e5 file_4 b3b20ec47d5cf70573aa207afe08004b file_5 943864ecde3880a2be1e718162ba873a file_6 > md5sum * 7733a403dbe0de660362928f504cb9fc file_1 7733a403dbe0de660362928f504cb9fc file_2 7733a403dbe0de660362928f504cb9fc file_3 7733a403dbe0de660362928f504cb9fc file_4 7733a403dbe0de660362928f504cb9fc file_5 7733a403dbe0de660362928f504cb9fc file_6 > md5sum * 7733a403dbe0de660362928f504cb9fc file_1 7733a403dbe0de660362928f504cb9fc file_2 7733a403dbe0de660362928f504cb9fc file_3 7733a403dbe0de660362928f504cb9fc file_4 7733a403dbe0de660362928f504cb9fc file_5 7733a403dbe0de660362928f504cb9fc file_6 Once it starts giving correct values for all files, it seems to continue doing so until the next reboot, at least for the handful of iterations that I did manually. Below is a sample "cmp -l" of two of the files after a fresh boot. The files are actually bzip2 files, so the bits are expected to look pretty random. Both files are expected to read from disk incorrectly after a fresh boot. The corrupted section should be the one where the bits don't look random. In this example, the first file has one corrupted 4096-character segment, and the second file has two such segments. I've used "..." to denote parts of the 4096-character section that I've omitted. > cmp -l file_1 file_2 554016769 357 202 554016770 357 55 554016771 277 206 554016772 177 25 554016773 177 224 554016774 377 236 554016775 377 72 554016776 357 364 554016777 377 244 554016778 377 45 554016779 277 355 ... 554020861 377 335 554020862 367 227 554020863 235 254 554020864 377 257 761208833 355 237 761208834 50 376 761208835 304 175 761208836 145 375 761208837 34 377 761208838 243 377 761208839 67 377 761208840 123 377 ... 761212925 236 377 761212926 321 377 761212927 234 376 761212928 26 277 852713473 375 357 852713474 12 355 852713475 330 374 852713476 205 377 852713477 271 377 852713478 47 377 852713479 227 377 ...
see http://www.uwsg.indiana.edu/hypermail/linux/kernel/0702.2/2069.html Despite we have seen mostly 4K blocks filled with nulls, occasionally we have seen what seemed random content (which would be explained by stale cache) Also the problem was volatile, on identical machines sometimes it never showed up, but did repeatedly on others. After reboot this could change. Now running with iommu=soft until resolved.
With which version(s) are you experiencing this problem?
We have also been seeing similar problems, on Tyan S3870 boards. These are also dual Opteron boards (we have dual 265's), BUT they use the ServerWorks BCM-5785 (HT-1000) chipset. These machines all have 4x2GB DIMMs. We see random 4k blocks in large files corrupted in the pagecache. Rebooting clears the problem, so it's just in the cache, not on disk. It may take several days for an error to show up while these machines are in production. The problem may occur more quickly after a warm reboot. We have about 50 of these machines now, and it occurs randomly across them. We also more occasionally see ext3 journaling errors or other I/O error crashes. We've tried various fixes, including booting "noapic" and various IOMMU and memory hole BIOS options, all to no avail. Lowering the memory clock seems to reduce the occurrence but does not eliminate it. We will try "iommu=soft". This is using the Centos 2.6.9-42.0.10.ELsmp x86_64 kernel. Here's our thread about this, on the Centos list: http://lists.centos.org/pipermail/centos/2007-March/076103.html
Created attachment 10908 [details] Use sw-iommu per default on Nvidia boards First of all does anyone know if there were some advancements in Nvidias and AMDs research? Andy? First of all does anyone know if there were some advancements in Nvidias and AMDs research? Andy? This is the patch as written by Steve Langasek from Debian.
Hi. First of all does anyone know if there were some advancements in Nvidias and AMDs research? Andy? Secondly: I've reported this bug also to Debian because I'd like to see it "fixed" (well I'd better say workarounded in etch). See here for the bug report and all answers http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=404148 Now Steve Langasek wrote a small patch that should disable the hw-iommu for all nvidia boards per default. Please test it and report your results. Andi, you told weeks ago that you'd think about using sw-iommu per default if we don't find a real solution soon. I think the time has come. If someday Nvidia, AMD or somebody else find a real solution we could use the hwiommu again. I've attached the patch from Steve above, but perhaps we should also add an reference to the bugzilla.kernel.org bug-# in it. Best wishes, Chris.
AMD and Nvidia are still actively investigating this problem. Currently we are doing hardware traces which point to DMA happening after the OS thought that DMA is finished and thus cleaned up it's GART mapping. We are trying to pinpoint why this is happening, but there is no solution or root cause as of yet.
I'm hoping the best solution is found in time for the next kernel release, but if it's not we should err on the side of caution and use Steve's patch until a better solution is found and tested. Best to give up some performance for data integrity. I'm in the 'those who have lost data' camp on this one.
Possible additional hardware risk from this issue: We have a set of AMD/nvidia chipset machines installed in January 2007. We have been observing the same problems as already mentioned. *Importantly* we have had 3 disks report SMART errors and 3 different disks that "disappear" on a running machine and then are not seen during BIOS boot up. As this is 6 disks out of 25 I am concerned this failure rate is not due to faulty disk hardware but that it may be due to this issue. Some details: Half the nodes have Suse 10.2 (unmodified 2.6.18.2-34- default) and half have Suse 10.0 with 2.6.20.2-smp kernel compiled with the patch from http://www.uwsg.indiana.edu/hypermail/linux/kernel/0702.2/ 2069.html. Drives are Barracuda 7200.10 SATA ST3320620AS. Disk faults are independent of kernel etc..
Justin, this would be really new if your SMART errors are somehow related although I think they are not. What we are looking at appears to be a bad DMA transaction that basically gets one address wrong when going through the GART. This happens extremely rarely and appears to be very timing related, but should only put the wrong 4kB of data on the disk and nothing worse (if that is not worse enough!). AMD is actively working on this and through hardware traces is in progress of finding the root cause. At this time no root cause has been established but more on this hopefully soon.
SMART errors is a broken disk. Nothing to see here, go away.
nvidia posted some patch http://groups.google.pl/group/linux.kernel/browse_thread/thread/4783711fa3d7592/ dd12faec9a78f874
re comment #23: Actually, the patch is from AMD, and does not look Nvidia-specific. Here is a shorter link to the lkml thread that is from groups.google.com instead of groups.google.pl: http://groups.google.com/group/fa.linux.kernel/browse_thread/thread/51f41bf92fd8a948/
Actually the patch was first posted here,... (probably after being sent to Andi): http://www.nvnews.net/vbulletin/showthread.php?p=1223917#post1223917
For all the parties affected by this bug and involved with testing: did the patch resolve the issue? Any updates on the problem, testing results? Thanks.
Thanks for the reminder, this bug has been solved from AMD's perspective as the AGP specification did not permit a cached GART table and our patch resolved this. This patch went into 2.6.21 as ommit cf6387daf8858bdcb3e123034ca422e8979d73f1. However please all do test to confirm any paranoia as *silent* data corruption is one of those things that keeps us awake at night. Andi please change the status of the bug.
Fix merged in mainline
The intersting thing is that i have still this kind of bug with an i386 Kernel on top of 2 Amd 285 with 12 GB of Memory, with less then 4GB it runs fine. Memcheck did not find anything. The proposed patch was for x86_64 only. I get bad_pages and curruption with xfs_writes on high load from 2.6.18 until now 2.6.22. Greets, Eric
But if you are using i386 then the GART IOMMU is not used at all which was the root cause of this problem. It sounds like you have another problem as you can not boot with iommu=soft to work around this on i386. But we need to root cause your problem. How reproducible is it and what steps are necessary to reproduce it. Perhaps you should first open a second bug.