Most recent kernel where this bug did not occur: 2.6.15-1 Distribution: Debian Testing Hardware Environment: SuperMicro X6DVL-EG2, 4go RAM, 3ware 8006-2LP, 2xHitachi 2x80go Software Environment: Problem Description: Data corruption with kernel newer than 2.6.15-1. Steps to reproduce: tested with most kernel 2.6.16, 2.6.17 and 2.6.18 all doing same corruption. Solved the probleme leaving only 1go of RAM and setting IOMMU=off during boot time, with 4go RAM kernel hangs saying : 3w-xxxx: tw_map_scsi_sg_data(): pci_map_sg() failed. nommu_map_sg: overflow 2053d9000+4096 of device mask ffffffff 3w-xxxx: tw_map_scsi_sg_data(): pci_map_sg() failed. nommu_map_sg: overflow 2053d9000+4096 of device mask ffffffff 3w-xxxx: tw_map_scsi_sg_data(): pci_map_sg() failed. nommu_map_sg: overflow 2053d9000+4096 of device mask ffffffff and hangs badly.
On Mon, 2 Oct 2006 08:08:37 -0700 bugme-daemon@bugzilla.kernel.org wrote: > http://bugzilla.kernel.org/show_bug.cgi?id=7246 > > Summary: 3w-xxxx, IOMMU and >1go RAM > Kernel Version: 2.6.18 > Status: NEW > Severity: normal > Owner: scsi_drivers-other@kernel-bugs.osdl.org > Submitter: aarnoud@agematis.com > > > Most recent kernel where this bug did not occur: 2.6.15-1 > Distribution: Debian Testing > Hardware Environment: SuperMicro X6DVL-EG2, 4go RAM, 3ware 8006-2LP, 2xHitachi > 2x80go > Software Environment: > Problem Description: Data corruption with kernel newer than 2.6.15-1. > > Steps to reproduce: tested with most kernel 2.6.16, 2.6.17 and 2.6.18 all doing > same corruption. > > Solved the probleme leaving only 1go of RAM and setting IOMMU=off during boot > time, with 4go RAM kernel hangs saying : > > 3w-xxxx: tw_map_scsi_sg_data(): pci_map_sg() failed. > nommu_map_sg: overflow 2053d9000+4096 of device mask ffffffff > 3w-xxxx: tw_map_scsi_sg_data(): pci_map_sg() failed. > nommu_map_sg: overflow 2053d9000+4096 of device mask ffffffff > 3w-xxxx: tw_map_scsi_sg_data(): pci_map_sg() failed. > nommu_map_sg: overflow 2053d9000+4096 of device mask ffffffff > > and hangs badly. > James, Adam: I recall that we had a scsi driver recently which had problems similar to this, but I think it was with addresses over 4G, not over 1G. It also might not have been the 3ware driver. Do you recall? Alexandre, can you please send the full dmesg output for that machine? Thanks.
This driver is for the older 3ware controllers. They can only DMA to 32-bit addresses. I set pci_set_dma_mask(pdev, DMA_32BIT_MASK). When you have >= 4GB of RAM, you should use the IOMMU. The driver has a large default queue depth, so it can use up a lot of mappings. Perhaps IOMMU is turned off in the bios, or the aperature isn't sufficiently large enough for the mappings? -Adam On 10/2/06, Andrew Morton <akpm@osdl.org> wrote: > On Mon, 2 Oct 2006 08:08:37 -0700 > bugme-daemon@bugzilla.kernel.org wrote: > > > http://bugzilla.kernel.org/show_bug.cgi?id=7246 > > > > Summary: 3w-xxxx, IOMMU and >1go RAM > > Kernel Version: 2.6.18 > > Status: NEW > > Severity: normal > > Owner: scsi_drivers-other@kernel-bugs.osdl.org > > Submitter: aarnoud@agematis.com > > > > > > Most recent kernel where this bug did not occur: 2.6.15-1 > > Distribution: Debian Testing > > Hardware Environment: SuperMicro X6DVL-EG2, 4go RAM, 3ware 8006-2LP, 2xHitachi > > 2x80go > > Software Environment: > > Problem Description: Data corruption with kernel newer than 2.6.15-1. > > > > Steps to reproduce: tested with most kernel 2.6.16, 2.6.17 and 2.6.18 all doing > > same corruption. > > > > Solved the probleme leaving only 1go of RAM and setting IOMMU=off during boot > > time, with 4go RAM kernel hangs saying : > > > > 3w-xxxx: tw_map_scsi_sg_data(): pci_map_sg() failed. > > nommu_map_sg: overflow 2053d9000+4096 of device mask ffffffff > > 3w-xxxx: tw_map_scsi_sg_data(): pci_map_sg() failed. > > nommu_map_sg: overflow 2053d9000+4096 of device mask ffffffff > > 3w-xxxx: tw_map_scsi_sg_data(): pci_map_sg() failed. > > nommu_map_sg: overflow 2053d9000+4096 of device mask ffffffff > > > > and hangs badly. > > > > James, Adam: I recall that we had a scsi driver recently which had problems > similar to this, but I think it was with addresses over 4G, not over 1G. > It also might not have been the 3ware driver. Do you recall? > > Alexandre, can you please send the full dmesg output for that machine? > > Thanks. >
> > > 3w-xxxx: tw_map_scsi_sg_data(): pci_map_sg() failed. > > > nommu_map_sg: overflow 2053d9000+4096 of device mask ffffffff > > > 3w-xxxx: tw_map_scsi_sg_data(): pci_map_sg() failed. > > > nommu_map_sg: overflow 2053d9000+4096 of device mask ffffffff > > > 3w-xxxx: tw_map_scsi_sg_data(): pci_map_sg() failed. > > > nommu_map_sg: overflow 2053d9000+4096 of device mask ffffffff nommu_* means he didn't compile in the IOMMU code. Operator error -> invalid. -Andi
Reply-To: James.Bottomley@SteelEye.com On Mon, 2006-10-02 at 20:17 +0200, Andi Kleen wrote: > nommu_* means he didn't compile in the IOMMU code. > > Operator error -> invalid. But how did we get sg segments over the mask? The block layer should have bounced them, I think. James
On Monday 02 October 2006 20:24, James Bottomley wrote: > On Mon, 2006-10-02 at 20:17 +0200, Andi Kleen wrote: > > nommu_* means he didn't compile in the IOMMU code. > > > > Operator error -> invalid. > > But how did we get sg segments over the mask? The block layer should > have bounced them, I think. Yes, but it can't when the bounce code is not compiled in. In this case it just printks and you saw those printks. -Andi
Reply-To: James.Bottomley@SteelEye.com On Mon, 2006-10-02 at 20:35 +0200, Andi Kleen wrote: > Yes, but it can't when the bounce code is not compiled in. > In this case it just printks and you saw those printks. That's this bit of subversion in ll_rw_blk.c:blk_queue_bounce_limit() #if BITS_PER_LONG == 64 /* Assume anything <= 4GB can be handled by IOMMU. Actually some IOMMUs can handle everything, but I don't know of a way to test this here. */ if (bounce_pfn < (min_t(u64,0xffffffff,BLK_BOUNCE_HIGH) >> PAGE_SHIFT)) dma = 1; q->bounce_pfn = max_low_pfn; #else ? That really looks wrong ... it will fail, as we've seen for 64 bit platforms with no iommu. We should init the isa pool in that case. James
> That really looks wrong ... it will fail, as we've seen for 64 bit > platforms with no iommu. That is just an operator error. It's like compiling your kernel without the driver for your root file system. Yes, CONFIG_* gives you plenty of rope ... > We should init the isa pool in that case. No. -Andi
Created attachment 9148 [details] .config for 2.6.18
Created attachment 9149 [details] dmesg of 2.6.18
Additional Informations: Debian Testing AMD64, Intel Xeon 3.0ghz Using onboard sata controler, I don't get corruption with 2.6.18, tell me if you need additional informations. thanks
If there is no more problems then it looks like the bug can be closed now. Thanks.
(In reply to comment #11) > If there is no more problems then it looks like the bug can be closed now. > Thanks. > The bug still exists as you can see in the following bug report on debian: "linux-2.6.18-6-amd64: 3w-xxxx v1.26.02.001 causes datacorruption with AMD64/EM64T with >2GB RAM" http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=464923 The problem is with the driver included in the kernel. More information on http://www.3ware.com/KB/article.aspx?id=15243 I'll quote the first post of the bug report, it includes very relevant/useful information: " Package: linux-2.6.18-6-amd64 Severity: critical Tags: patch Justification: causes serious data loss How to reproduce: 10036:/tmp# uname -a Linux 10036 2.6.18-6-amd64 #1 SMP Wed Jan 23 06:27:23 UTC 2008 x86_64 GNU/Linux 10036:/tmp# grep MemTotal /proc/meminfo MemTotal: 8179992 kB 10036:/tmp# dd if=/dev/urandom bs=1048576 count=10240 | tee testfile | md5sum 00a513dc3da637c4c86557102b0e6098 - 10239+1 records in 10239+1 records out 10737297534 bytes (11 GB) copied, 1584.6 seconds, 6.8 MB/s 10036:/tmp# md5sum testfile bfceb91a358dfc3d09e22ad74b7ebefb testfile 10036:/tmp# How to fix: There is a 3ware KB article on this: http://www.3ware.com/KB/article.aspx?id=15243 This includes "3ware Storage Controller device driver for Linux v1.26.03.000-2.6.18." >From the driver: 1.26.03.000 - Use default DMA data direction to prevent data corruption when using SWIOTLB with 4GB+ on EM64T. Installing this fixes the problem for me. -- System Information: Debian Release: 4.0 APT prefers stable APT policy: (500, 'stable') Architecture: i386 (i686) Shell: /bin/sh linked to /bin/bash Kernel: Linux 2.6.18-4-vserver-k7 Locale: LANG=C, LC_CTYPE=C (charmap=ANSI_X3.4-1968) " My server installation is also suffering from this bug (datacorruption): Debian Release: 4.0 (stable) Architecture: amd64 (64bits) Kernel: 2.6.18-5-amd64 RAM: 4GB RAID Controller: 3w-8500 series Processor: Intel Xeon - EMT64 As I've stated before, the problem resides in the driver included in the kernel, so it's just a matter of replacing it as a patch in the kernel.
Javier, The final issue you are talking about (different from original issue in this bugzilla entry): The 3w-xxxx driver calling pci_map_sg() with sc_data_direction == DMA_BIDIRECTIONAL caused data corruption when going through swiotlb.c, on EM64T with 4GB or higher of RAM. AMD64 systems using IOMMU were never affected. This problem doesn't exist in the 3w-xxxx driver in kernels 2.6.23 and higher. The reason is the 'scsi data buffer accessors' patches removed most instances of scsi drivers calling pci_map_sg() and replaced them with scsi_dma_map(). This corrected the problem of the 3w-xxxx driver over-riding the default sc_data_direction that was causing data corruption with EM64T systems with 4GB+ RAM. If you need a driver update for an older kernel to fix this issue, please go to www.3ware.com, however no driver patch to 3w-xxxx needs to be sent to the kernel tree to fix this issue. Please close this BUG. -Adam Radford
Thanks, Adam - closing the bug.