Bug 9468 - r8169, swiotlb, 4GB RAM, breakage
r8169, swiotlb, 4GB RAM, breakage
Status: CLOSED CODE_FIX
Product: Drivers
Classification: Unclassified
Component: Network
All Linux
: P1 normal
Assigned To: Francois Romieu
:
Depends on:
Blocks:
  Show dependency treegraph
 
Reported: 2007-11-28 13:43 UTC by Alistair Strachan
Modified: 2011-07-14 08:00 UTC (History)
3 users (show)

See Also:
Kernel Version: 2.6.24-rc3
Tree: Mainline
Regression: No


Attachments
pci_unmap unbalance (475 bytes, text/plain)
2008-08-20 14:49 UTC, Francois Romieu
Details

Description Alistair Strachan 2007-11-28 13:43:47 UTC
(Shameless repost from my mail to LKML. Seems to have gone quiet so I thought I wouldn't let this bugger get lost.)

I have recently assembled a Core 2 Duo system with 4GB RAM and I believe there 
might be a bug in the r8169 driver in >4GB RAM configurations.

Initially I can use one of two active r8169 NICs on the motherboard with this 
quantity of RAM with other devices, without issue. But after some amount of 
data (generally about 50MB), no more network packets are sent/received.

The "choke" affects other devices on the system too, notably libata, which 
does not recover gracefully. In my logs, I see a stream of:

DMA: Out of SW-IOMMU space for 7222 bytes at device 0000:04:00.0
DMA: Out of SW-IOMMU space for 7222 bytes at device 0000:04:00.0

The device 0000:04:00.0 corresponds to one of the r8169s.

The reason I believe r8169 is at fault is that I was doing a rebuild of my 
RAID5 across 3 SATA drives via libata's ahci driver, and transferring over the 
network. When the "choke" occurred the RAID sync stopped, libata errors were 
seen, and I simply did a "ifconfig br0 down" (which contained the r8169) and 
the messages went away. Bringing the NIC up again would see some initial 
functionality then very rapidly it would go back to the same error messages.

The Intel chipset I am using does not support any kind of hardware IOMMU, so I 
am forced to use swiotlb in a 4GB RAM configuration. In an attempt to delay 
the failures, I used the swiotlb option to increase the swiotlb's page 
allocation with "swiotlb=65536" (which seems to correspond to a 256MB bounce 
buffer).

Assuming both libata and r8169 use the swiotlb, and both systems are impaired 
when these messages appear, removing r8169 would appear to be key. Indeed, if 
there is no significant libata activity, the problem still occurs on the NIC 
within approximately the same amount of transfer.

This option delays the failure for some time but it will happen eventually, 
which makes me suspicious that maybe the driver is somehow pinning an area of 
the buffer and not releasing it. (I hunted bugzilla for reports similar to 
this one, but couldn't find anything.)

Having tested the r8169 driver on an AMD system I did not experience the same 
problems with 4GB RAM, so this could be a bug specific to swiotlb. I would 
have added more people to CC but I have no idea who might be responsible.

mem=2000M on the cmdline works around this issue.
Comment 1 Alistair Strachan 2008-03-12 17:31:27 UTC
No change with 2.6.25-rc5. Disabling the IOMMU still works around the problem (albeit at the loss of half my RAM).
Comment 2 Alistair Strachan 2008-05-17 10:39:38 UTC
Still a problem with 2.6.26-rc2.
Comment 3 Branko Badrljica 2008-06-11 15:17:33 UTC
Same here with 2.6.25-r4 ( gentoo-sources) but on the different system -AMD Phenom with 4GB RAM.

Interesting things are:

If I take out phenom 9850 and plug in X2 6000+, IOMMU error vanishes and system works perfectly normally

- on Phenom workaround is to use smaller MTU. I had mtu 7200 which was pro9blematic, so I have lowered it. It seems that anything up to 3600 works fine for me, but I use 2048 just to be sure....
Comment 4 Alistair Strachan 2008-06-11 15:35:32 UTC
As I tried to explain in my original report, my problem is with the swiotlb support, NOT the AMD hardware IOMMU. My Intel chipset does not support any hardware IOMMU. Unless you're explicitly disabling GART IOMMU on AMD, I can't imagine your problem is the same as mine.

Of course, it would be really interesting if it were. Do you see exactly the same messages? Could I see your dmesg?

Other posters have mentioned the lower MTU, but I still have problems (packet loss, unreliability) even at 1500, it just won't crash the machine (but who knows, maybe it just takes 5-6 times longer to do so).
Comment 5 Francois Romieu 2008-08-20 14:49:14 UTC
Created attachment 17344 [details]
pci_unmap unbalance

Alistair, can you give the attached patch a try agaisnt 2.6.27-rc ?

-- 
Ueimor
Comment 6 Alistair Strachan 2008-08-20 15:32:45 UTC
Thanks a LOT. This patch seems to fix the issue completely and I'm running with my swiotlb once again in 2.6.27-rc3. Transferred >4GB data over the link with 7.2k frames, which is the most I've ever been able to do with the r8169 driver + Intel + 4GB RAM, no warnings to dmesg.

I think you've plugged this particular leak. Well done!

Tested-by: Alistair John Strachan <alistair@devzero.co.uk>
Comment 7 Adrian Bunk 2008-08-28 13:15:58 UTC
fixed by commit a866bbf6aacf95f849810079442a20be118ce905
Comment 8 4hya 2011-07-14 04:51:54 UTC
how would I use this patch for linux mint 11 x64?
Comment 9 Francois Romieu 2011-07-14 08:00:47 UTC
(In reply to comment #8)
> how would I use this patch for linux mint 11 x64 ?

Your distribution claims to be 2.6.38 based. If so the patch is already
included in it.

Please contact your distribution vendor.

-- 
Ueimor

Note You need to log in before you can comment on or make changes to this bug.