Bug 12084 - Silent data corruption on disk with nforce 630
Summary: Silent data corruption on disk with nforce 630
Status: REJECTED INVALID
Alias: None
Product: IO/Storage
Classification: Unclassified
Component: Serial ATA (show other bugs)
Hardware: All Linux
: P1 high
Assignee: Tejun Heo
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2008-11-22 15:35 UTC by Stéphane Birot
Modified: 2008-11-29 02:17 UTC (History)
0 users

See Also:
Kernel Version: 2.6.28-rc6
Subsystem:
Regression: ---
Bisected commit-id:


Attachments
dmesg.log (62.11 KB, text/plain)
2008-11-22 17:19 UTC, Stéphane Birot
Details

Description Stéphane Birot 2008-11-22 15:35:27 UTC
When copying files, the copied file may be different than the original file. And there is no error in dmesg. It showed up with Ubuntu 8.10 amd64 (kernel 2.6.27-7-generic), and there is so much corruption that I couldn't even install it from CD. On Ubuntu 8.10 i386, it happens less often. It doesn't happen with an older version (7.04, kernel 2.6.20.x).

So I made some tests on a debian by compiling kernels, only with amd64 version. I tested kernel 2.6.28-rc6 and data corruption happened. The problem does not happen with kernel 2.6.20.21, but it happens with kernel 2.6.21. I made test only with amd64, because the problem reproduce almost systematicly. The corruption shows up even with no so big files (50Mb).

The installed devices are :
	- IDE drive (Western Digital 80Gb)
	- SATA drive (Maxtor 250Gb SATA II)
	- SATA DVD-RW

I tested with only the IDE drive and there was also data corruption. With only the two SATA devices, it also happened. It happens if I copy from one disk to another, from a disk to itself, or from the CD to a disk. The corruption happen on disk write. If I repeat md5sum on the same file it always report the same value.

On a 350Mb file, there was 58 bytes that where differents. And most of the time there is only 1 bit different between the two bytes. 

The configuration is :
	- Asus M2N-VM DVI mainboard
	- One G-Skill DDR2 module of 2Gb

This PC is 1 month old. I tested memory with memtest86 during 5 hours, and also with memtester, with no errors. I flashed the BIOS with latest version. There is no data corruption in Windows XP by copying the sames files.
Comment 1 Tejun Heo 2008-11-22 16:45:51 UTC
If it's happening on both PATA and SATA when copying to themselves, it's unlikely to be a specific driver problem and bit flipping strongly indicates the corruption happens while the data is in transit on some bus or memory as opposed to logic problems such as overwriting in-use page or using the wrong page.

Can you please post kernel boot log and the result of "lspci -nn"?  Also, is it possible to get another memory stick and test whether the problem persists even with a different module?  Please note that memtest86 is good when they're failing but no failure doesn't necessarily mean there is no problem.  Memory corruption can depend on a lot of things such as CPU and DMA trying to access close regions concurrently or whatnot.
Comment 2 Stéphane Birot 2008-11-22 17:19:11 UTC
Created attachment 18975 [details]
dmesg.log
Comment 3 Stéphane Birot 2008-11-22 17:43:59 UTC
I have only one DDR2 module so I can't test with another one right now. I will see if I can get one.

Also I made my tests using SATA drivers (the IDE drive appears as /dev/sdb). But the corruption also happened with the original debian kernel where the IDE drive appears as /dev/hda.

--- lspci -------------------

00:00.0 RAM memory [0500]: nVidia Corporation MCP67 Memory Controller [10de:0547] (rev a2)
00:01.0 ISA bridge [0601]: nVidia Corporation MCP67 ISA Bridge [10de:0548] (rev a2)
00:01.1 SMBus [0c05]: nVidia Corporation MCP67 SMBus [10de:0542] (rev a2)
00:02.0 USB Controller [0c03]: nVidia Corporation MCP67 OHCI USB 1.1 Controller [10de:055e] (rev a2)
00:02.1 USB Controller [0c03]: nVidia Corporation MCP67 EHCI USB 2.0 Controller [10de:055f] (rev a2)
00:04.0 USB Controller [0c03]: nVidia Corporation MCP67 OHCI USB 1.1 Controller [10de:055e] (rev a2)
00:04.1 USB Controller [0c03]: nVidia Corporation MCP67 EHCI USB 2.0 Controller [10de:055f] (rev a2)
00:06.0 IDE interface [0101]: nVidia Corporation MCP67 IDE Controller [10de:0560] (rev a1)
00:07.0 Audio device [0403]: nVidia Corporation MCP67 High Definition Audio [10de:055c] (rev a1)
00:08.0 PCI bridge [0604]: nVidia Corporation MCP67 PCI Bridge [10de:0561] (rev a2)
00:09.0 SATA controller [0106]: nVidia Corporation MCP67 AHCI Controller [10de:0554] (rev a2)
00:0a.0 Ethernet controller [0200]: nVidia Corporation MCP67 Ethernet [10de:054c] (rev a2)
00:0b.0 PCI bridge [0604]: nVidia Corporation MCP67 PCI Express Bridge [10de:0562] (rev a2)
00:0c.0 PCI bridge [0604]: nVidia Corporation MCP67 PCI Express Bridge [10de:0563] (rev a2)
00:0d.0 PCI bridge [0604]: nVidia Corporation MCP67 PCI Express Bridge [10de:0563] (rev a2)
00:0e.0 PCI bridge [0604]: nVidia Corporation MCP67 PCI Express Bridge [10de:0563] (rev a2)
00:0f.0 PCI bridge [0604]: nVidia Corporation MCP67 PCI Express Bridge [10de:0563] (rev a2)
00:10.0 PCI bridge [0604]: nVidia Corporation MCP67 PCI Express Bridge [10de:0563] (rev a2)
00:11.0 PCI bridge [0604]: nVidia Corporation MCP67 PCI Express Bridge [10de:0563] (rev a2)
00:12.0 VGA compatible controller [0300]: nVidia Corporation GeForce 7050 PV / nForce 630a [10de:053b] (rev a2)
00:18.0 Host bridge [0600]: Advanced Micro Devices [AMD] K8 [Athlon64/Opteron] HyperTransport Technology Configuration [1022:1100]
00:18.1 Host bridge [0600]: Advanced Micro Devices [AMD] K8 [Athlon64/Opteron] Address Map [1022:1101]
00:18.2 Host bridge [0600]: Advanced Micro Devices [AMD] K8 [Athlon64/Opteron] DRAM Controller [1022:1102]
00:18.3 Host bridge [0600]: Advanced Micro Devices [AMD] K8 [Athlon64/Opteron] Miscellaneous Control [1022:1103]

--- /proc/ioports -------------------

0000-001f : dma1
0020-0021 : pic1
0040-0043 : timer0
0050-0053 : timer1
0060-0060 : keyboard
0064-0064 : keyboard
0070-0071 : rtc0
0080-008f : dma page reg
00a0-00a1 : pic2
00c0-00df : dma2
00f0-00ff : fpu
0170-0177 : 0000:00:06.0
  0170-0177 : pata_amd
01f0-01f7 : 0000:00:06.0
  01f0-01f7 : pata_amd
0230-023f : pnp 00:09
0290-029f : pnp 00:09
0376-0376 : 0000:00:06.0
  0376-0376 : pata_amd
03c0-03df : vga+
03f6-03f6 : 0000:00:06.0
  03f6-03f6 : pata_amd
03f8-03ff : serial
04d0-04d1 : pnp 00:05
0500-057f : pnp 00:05
  0500-0503 : ACPI PM1a_EVT_BLK
  0504-0505 : ACPI PM1a_CNT_BLK
  0508-050b : ACPI PM_TMR
  0510-0515 : ACPI CPU throttle
  0520-0527 : ACPI GPE0_BLK
0580-05ff : pnp 00:05
0600-063f : 0000:00:01.1
0700-073f : 0000:00:01.1
0800-080f : pnp 00:05
0880-08ff : pnp 00:05
  08a0-08af : ACPI GPE1_BLK
0900-09ff : 0000:00:01.0
0a00-0a0f : pnp 00:09
0a10-0a1f : pnp 00:09
0cf8-0cff : PCI conf1
0d00-0d7f : pnp 00:05
0d80-0dff : pnp 00:05
1100-117f : pnp 00:05
1180-11ff : pnp 00:05
d880-d887 : 0000:00:0a.0
  d880-d887 : forcedeth
dc00-dc0f : 0000:00:09.0
  dc00-dc0f : ahci
e000-e003 : 0000:00:09.0
  e000-e003 : ahci
e080-e087 : 0000:00:09.0
  e080-e087 : ahci
e400-e403 : 0000:00:09.0
  e400-e403 : ahci
e480-e487 : 0000:00:09.0
  e480-e487 : ahci
ec00-ec3f : 0000:00:01.1
ffa0-ffaf : 0000:00:06.0
  ffa0-ffaf : pata_amd
Comment 4 Stéphane Birot 2008-11-28 16:32:19 UTC
This is due to a faulty memory module. Memtest86 finally showed errors after running again for hours.

Sorry for the fake bug.
Comment 5 Tejun Heo 2008-11-29 02:17:16 UTC
Heh.. thanks for finding out the actual problem.  It's really relieving.  :-)  Marking INVALID.
Comment 6 Tejun Heo 2008-11-29 02:17:25 UTC
Marking...

Note You need to log in before you can comment on or make changes to this bug.