Bug 11811 (ath9k-DMA-SW-IOMMU) - ath9k / DMA: Out of SW-IOMMU space for 4224 bytes at device 0000:0b:00.0
Summary: ath9k / DMA: Out of SW-IOMMU space for 4224 bytes at device 0000:0b:00.0
Status: CLOSED CODE_FIX
Alias: ath9k-DMA-SW-IOMMU
Product: Drivers
Classification: Unclassified
Component: network-wireless (show other bugs)
Hardware: All Linux
: P1 high
Assignee: Luis Chamberlain
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2008-10-23 00:45 UTC by Matthias Goldhoorn
Modified: 2008-12-11 15:11 UTC (History)
11 users (show)

See Also:
Kernel Version: 2.6.27
Subsystem:
Regression: ---
Bisected commit-id:


Attachments
/var/log/messages (410.03 KB, text/plain)
2008-10-23 00:45 UTC, Matthias Goldhoorn
Details
lspci from clean boot (1.96 KB, text/plain)
2008-10-23 00:47 UTC, Matthias Goldhoorn
Details
Patch to disable driver on x86_64 when swiotlb is in use (1.27 KB, text/plain)
2008-11-17 18:38 UTC, Chuck Ebbert
Details
swiotlb - disable use of overflow buffer, fixup ranges, untested (4.70 KB, patch)
2008-11-18 23:08 UTC, Maciej Żenczykowski
Details | Diff
swiotlb - disable use of overflow buffer, etc - tested (7.36 KB, patch)
2008-11-24 15:59 UTC, Maciej Żenczykowski
Details | Diff

Description Matthias Goldhoorn 2008-10-23 00:45:01 UTC
Latest working kernel version: 2.6.27
Earliest failing kernel version: 2.6.27
Distribution: Gentoo
Hardware Environment: MacBook Pro Core2Duo 2Ghz
Software Environment: 
Problem Description: Massive DMA Problems with ath9k driver

Steps to reproduce:
Boot kernek, load atheros Driver, download anything. After around 30-120sekounds Kernel Hangs for Wlan and DISK Access.
Massive kernel outputs:

Oct 14 15:35:03 localhost DMA: Out of SW-IOMMU space for 4224 bytes at device 0000:0b:00.0

after a while

Oct 14 15:35:04 localhost ata1.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x0
Oct 14 15:35:04 localhost ata1.00: cmd 35/00:00:b8:17:51/00:04:17:00:00/e0 tag 0 dma 524288 out
Oct 14 15:35:04 localhost res 50/00:00:77:76:1c/00:00:16:00:00/e0 Emask 0x40 (internal error)
Oct 14 15:35:04 localhost ata1.00: status: { DRDY }
Oct 14 15:35:04 localhost DMA: Out of SW-IOMMU space for 4224 bytes at device 0000:0b:00.0
Oct 14 15:35:04 localhost ata1.00: configured for UDMA/133
Oct 14 15:35:04 localhost ata1: EH complete

So only an hard Reboot solves Problem. This Problem occours only on load the Atheros driver. HD Driver only works without Problem.
Found similar Problem at:

http://kerneltrap.org/mailarchive/linux-kernel/2008/8/4/2815804


Attached /var/log/messages...
Comment 1 Matthias Goldhoorn 2008-10-23 00:45:53 UTC
Created attachment 18413 [details]
/var/log/messages

Messages from Crash
Comment 2 Matthias Goldhoorn 2008-10-23 00:47:28 UTC
Created attachment 18414 [details]
lspci from clean boot
Comment 3 John W. Linville 2008-10-23 07:55:47 UTC
Did you try booting with any of the larger swiotlb= settings on the kernel command line, as described here?

   http://kerneltrap.org/mailarchive/linux-kernel/2008/8/3/2800864

Not sure what would cause a problem like this?  An excessively large amount to in-flight DMA transactions?
Comment 4 Matthias Goldhoorn 2008-10-24 00:33:03 UTC
I have tryed to increase the swiotlb to 64k and 128k but the problem occures same way.
Only difference is that the time (download & compile) is larger and FileSystems errors are larger.
On 128k it takes tice a time as with 64k.
Comment 5 Matthias Goldhoorn 2008-11-02 08:32:39 UTC
Some news here,
if i boot up wirg mem=3G then my System will work without Problem's.
I use an RamUpgrade to 4GB, all other OS'es work withour Problem. Memcheck tools says RAM is ok.
So the Problem depends on availible RAM...

Any new Ideas?
Comment 6 Chuck Ebbert 2008-11-13 09:47:34 UTC
Reported in Fedora 9, kernel 2.6.27.5:

https://bugzilla.redhat.com/show_bug.cgi?id=471329

Causes filesystem corruption when it happens. Shouldn't swiotlb default to never returning a fallback buffer and always panic when it runs out of space so that filesystem corruption is prevented?
Comment 7 Luis Chamberlain 2008-11-14 18:13:34 UTC
I'm looking into this.

Two distribution bugs are opened also for this:

https://bugzilla.redhat.com/show_bug.cgi?id=471329
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/267089

Lets consolidate them into this one.

So far this seems to be a MacBook Pro 3.1 issue on Intel 965 Memory controller which uses software for its controller and where > 3 GB on a x86_64 kernel.

I was not able to reproduce this on a x86_64 box with Intel quad core with > 7 GB of memory, however the memory controller was different.

The only other interesting thing is it works with MadWifi.

I'm stumped.

General question: anyone know if lib/swiotlb.c can handle > 3 GB just fine?

Also please try these patches although I doubt they help:

http://www.kernel.org/pub/linux/kernel/people/mcgrof/patches/ath9k/2008-11-14/

The second one is the only that should really have an effect if anything.
Comment 8 Luis Chamberlain 2008-11-14 19:23:57 UTC
OK so I tried booting with swiotlb=force on the a box with x86_64  and > 7GB RAM to force to use the same software IO controller for DMA and ping flooded my AP for a while and I see no issue. This was with kernel:

2.6.27.5-101.fc10.x86_64

So... not sure what else I can try without having the hardware present here, seems we may need to buy it something. We will try to count the amount of DMA pages we keep around too to see if we are indeed hogging it up.
Comment 9 Luis Chamberlain 2008-11-14 19:24:36 UTC
Oh and on my ping flood I got 0% packet loss too!
Comment 10 Luis Chamberlain 2008-11-15 08:52:37 UTC
Under what conditions exactly do you see this issue BTW? Do you just boot up and get it immediately, if after a while, how long is "after a while", and how much TX/RX traffic?

Can you please give a reproducible scenario under which we can try to reproduce this?
Comment 11 Toon Verwaest 2008-11-15 09:20:47 UTC
I yesterday got it after installing Gentoo on my machine; since my previous hd was broken for other reasons and I had to reinstall anyway. Anyhow, I just downloaded the newest kernel from kernel.org (2.6.27.6) and enabled ath9k. I didn't do anything special; I rebooted into the system without even logging into my access point. I just enabled ath0. I didn't look at my machine for 10 minutes and the whole partition which I had just installed was broken. Half of the data was put into lost+found etc.

On previous tries on Ubuntu it took less than 5 minutes and it always partly broke my filesystem.

I'd rather not try to use it too much because of that reason though ... I need my system up and running for my work; as it's my only system.
Comment 12 Luis Chamberlain 2008-11-15 09:28:19 UTC
Wow, ok, yeah I understand. So we're working hard trying to reproduce it.. but no luck so far so trying to gather more information to see how we can reproduce it.

Also trying to see under what scenario it will happen, do you TX a lot, or only RX? Or do you just a slight amount of browsing?

Can someone see if they can trigger this by just boot up and then try pinging only a few packets, like 10 for example:

ping -c 10 192.168.1.1

where the 192.168.1.1 is supposed to be your gateway of course, then try increasing it, step by step.

If you can provide more details it'd be great.

I understand the issue is affecting partitions but in order to test perhaps it makes sense to try to mount read only then while testing?
Comment 13 Luis Chamberlain 2008-11-15 21:36:32 UTC
OK so I left a ping flood overnight on a box with ath9k, swiotlb and > 7GB RAM. It didn't panic nor did I see any issues. Eventually it just detected no probe response from the AP but that is separate, the card I tested with was AR9280 though but for 11G purpose the DMA'ing should follow the same path.

Small note: As per the x86_64 kernel boot parameters documentation all Intel boxen uses the SW IO TLB when using DMA over 3 GB. Not sure why the 3GB limit though but its stated there.

I left a regular ping overnight now.

This seems to be more and more platform specific than I thought, I guess Linville has been right all along maybe throw in the mix the SATA driver and you may get the issue.
Comment 14 Matthias Goldhoorn 2008-11-16 03:43:31 UTC
So i just testet when the error occures.
I booted up wihout mem option, and first i no associated to my AccessPoint, after 30min no error occures. so i rebooted (without mem option too) and only associate to my access point an get an IP adress via DHCP.
I looked again 20min later an i see a lot of SW-IOMMU errors...
i'm not sure if the error occures 5-15min, afer 20min it was definitly there...

As i got first time this error i would download NVIDIA-drivers, this is an 20MB package. Same error occues as i tryed to copy some data via scp.
Comment 15 Luis Chamberlain 2008-11-16 10:31:02 UTC
Can you try booting up and not touching the hard drive at all, maybe making it read only and try ping flooding the AP with ath9k?
Comment 16 Matthias Goldhoorn 2008-11-16 12:57:44 UTC
I yst booted up from n external harddisk i copyed just before my system via rsync to it.
After bootup i tryed to download anythink again, after 15megs i got the SW-IOMMU error again. A short time later, my USB Port resettet and my USB HD missed. i got some access errors. (sorry no log is written).
So because of this my System stop workin and i have to reboot with my mem option
Comment 17 Chuck Ebbert 2008-11-17 14:35:25 UTC
Can we get a stopgap patch that keeps the driver from loading if there is memory present above 4G? Fedora 10 will probably ship with the driver disabled completely unless we can find some temporary fix...
Comment 18 John W. Linville 2008-11-17 16:28:11 UTC
I'm not sure how such a patch would look.  Do you have an example (or a patch)?
Comment 19 Luis Chamberlain 2008-11-17 17:24:48 UTC
I'm sitting with Maciej Żenczykowski, an owner of a MacBook Pro 3.1 and it seems he's willing to let me cripple^w fix the issue here.

I'll provide updates as we go.
Comment 20 Chuck Ebbert 2008-11-17 18:38:41 UTC
Created attachment 18902 [details]
Patch to disable driver on x86_64 when swiotlb is in use

This builds (it hasn't linked the modules yet) but I can't test it.
Comment 21 Chuck Ebbert 2008-11-17 19:58:36 UTC
Can someone confirm the above patch keeps the driver from activating when swiotlb is in use on x86_64?
Comment 22 Toon Verwaest 2008-11-18 03:05:55 UTC
I can confirm that after applying the patch, while the driver still gets loaded (and is listed in lsmod), it seems to be inactive. iwconfig doesn't show the related device anymore and my system (at least for now; I booted 5 minutes ago) seems stable.

I just removed the linux-xx prefix since I'm working on a 2.6.27.6.
Comment 23 Chuck Ebbert 2008-11-18 11:53:07 UTC
And does the driver work if booted with mem=3G?
Comment 24 Maciej Żenczykowski 2008-11-18 12:23:43 UTC
We've located one of the problems (pci_unmap_single is being called with a smaller length then the original pci_map_single), however after applying the one-line patch for this problem, we're now seeing kernel panics.  It's also possible we haven't caught all pci/map/unmap/single pairs.

http://lxr.linux.no/linux+v2.6.27.6/drivers/net/wireless/ath9k/recv.c#L1018
should have "skb_end_pointer(skb) - skb->head" instead of "sc->sc_rxbufsize"

(see ath_skb_unmap_single at bottom of file)

---

The last panic I've seen is:

ret_from_intr -> do_IRW irq_exit do_softirq call_softirq __do_softirq tasklet_action ath9k_tasklet ath_rx_tasklet ath9k_hw_gettsf64 ath__rx_indicate __ieee80211_rx __ieee80211_rx_handle_packet   kfree_skb  __kfree_skb skb_release_all put_page+0x14 gpf(0000)

This one was triggered by booting a single-line patched 2.6.27.5-37.fc9.x86_64 kernel into init=/bin/bash, modprobe'ing ath9k, and ip link set wlan0 up.
No essid was set, nothing else was performed, it was just left to sit overnight.

This panic is moderately interesting since all the previous flood ping triggered ones implied swiotlb somewhere in the call stack - this one doesn't.
Comment 25 Maciej Żenczykowski 2008-11-18 12:24:31 UTC
Booting with mem=3G does indeed seem to make a flood ping work stabily (sent some 50K packets without problems)
Comment 26 Maciej Żenczykowski 2008-11-18 12:25:35 UTC
(the above comment is not relevant to the patch to disable ath9k on swiotlb - I haven't tested that yet, would prefer to fix it)
Comment 27 Luis Chamberlain 2008-11-18 19:45:26 UTC
Thanks to Maciej for his patience the yesterday on allowing me to work with him on abusing his MacBook Pro 3.1. So here are two patches based on our observations yesterday:

http://www.kernel.org/pub/linux/kernel/people/mcgrof/patches/ath9k/2008-11-19/27-iommu/

The first patch addresses a different length on pci_unmap_single(). The second patch ensures we only use 4KB on the buffers.

As the commit message says, a panic is expected now though, that which Maciej mentions. We need to iron this one issue out now. If anyone can get a full stack trace that would be very useful. Try compiling your kernel with debugging symbols.
Comment 28 Maciej Żenczykowski 2008-11-18 21:46:26 UTC
Here's the explanation for disk corruption:

Comment at:
  http://lxr.linux.no/linux+v2.6.27.6/lib/swiotlb.c#L88
is
  "When the IOMMU overflows we return a fallback buffer."

Hence, if we fill up the bounce buffers (as was happening here) then we end up always returning the bounce buffer.  (Here: http://lxr.linux.no/linux+v2.6.27.6/lib/swiotlb.c#L579 ).  If we then fail to test the return value via swiotlb_dma_mapping_error then we'll always end up using this buffer for dma.  Obviously this results in collisions and we end up sending network rx/tx frames to disk, etc.

On a 4G machine, with a 3G ram + 1G hole + 1G ram memory layout, the probability of needing bounce buffers is 25% (the top 1G needs it for DMA32).  So 1 in 4 reads/writes to disk will end up using the above emergency buffer, if a few of these happen at once - you end up reading/writing/dma'ing garbage.

Furthermore, it might be possible to trigger this bug via kernel boot option "iommu=soft,force" even on machines with <=3G of RAM.  The soft to force the use of SW-IOMMU (instead of potential hw solution), the force to use it for all dma xfr's, not only those that actually need it.  This will hurt performance, but may make it easier to trigger this bug.

Furthermore, the logic at:
http://lxr.linux.no/linux+v2.6.27.6/lib/swiotlb.c#L570
seems flawed, since we're only testing the beginning of the buffer, instead of the end.  Imagine a 2 byte DMA32 transfer at 4GB-1.  It passes the check, but only the first byte is DMA32 capable.  Unsure if this can ever actually be triggered with the alignment restrictions we have all over the place, but seems wrong.  My guess is
  !address_needs_mapping(hwdev, dev_addr)
should be
  !address_needs_mapping(hwdev, dev_addr + size - 1)
or
  !address_needs_mapping(hwdev, virt_to_bus(ptr + size - 1))
Comment 29 Maciej Żenczykowski 2008-11-18 21:48:23 UTC
(side note: my guess is the kernel should panic if it runs out of sw-iommu, using a fallback buffer seems inherently busted...)
Comment 30 Maciej Żenczykowski 2008-11-18 21:50:20 UTC
(not to mention, that we don't even check if the size of the dma we're trying to perform will or will not fit in the fallback buffer, so we might end up overflowing the fallback buffer as well... it's only 32KB... http://lxr.linux.no/linux+v2.6.27.6/lib/swiotlb.c#L90)
Comment 31 Maciej Żenczykowski 2008-11-18 22:08:11 UTC
map_single will actually panic if we attempt to map more than the overflow buffer size, so that's not a problem, and the scatter gather interface expects the caller to deal with errors correctly.

However, I still fail to see locking for the overflow fallback buffer.
Comment 32 Maciej Żenczykowski 2008-11-18 22:18:02 UTC
Ok, I fail to see the benefit of even having a fallback overflow buffer.

If we've exhausted 64MB of the primary bounce buffer, we'll soon exhaust the last 32KB of the overflow buffer as well.  If we haven't... then we don't need it.

Seems like needless complexity.

The only benefit seems to be that it's one extra DMA buffer that can be used via map_single, that cannot be used via the scatter-gather interfaces (those never return the fallback buffer, instead returning errors).

We either need to drop the fallback buffer, and just panic.  Or lock it on map_single, and unlock on unmap_single, and panic if map_single hits it locked.

The first case (drop it) can be done trivially by setting it's size to 0 bytes at http://lxr.linux.no/linux+v2.6.27.6/lib/swiotlb.c#L90.
Comment 33 Maciej Żenczykowski 2008-11-18 23:08:42 UTC
Created attachment 18925 [details]
swiotlb - disable use of overflow buffer, fixup ranges, untested

First attempt at a swiotlb overflow buffer disabling patch.
Comment 34 Maciej Żenczykowski 2008-11-19 03:17:21 UTC
The above kernel command line should be "iommu=soft swiotlb=force"
Comment 35 Luis Chamberlain 2008-11-19 12:34:14 UTC
Yeah I was reviewing the swiotlb code yesterday and I noticed the emergency stuff and it seemed fishy to me too. I'll review your patch.

Also note I had tested before running with "iommu=soft swiotlb=force" also with mem=4G and I never ran into any weird corruption issue. I ran TX/RX tests (GBs of data) while running a make allyesconfig of the kernel in a loop in the background.

So *something* forces the MacBook pro 3.1 to use more of the SWIOTLB bounce buffers than regular machines. Not sure what that could be. The patches we posted which for the DMA pci_unmap_single() should handle the starvation and also forcing us to correctly only use 4KB for our buffers rather than 8KB. Your patch should handle DMA race on the emergency buffer, and if its true that MacBook Pro 3.1s tend to use the emergency buffer more then it should fix possible DMA issues for it. I believe the starvation was caused by the different sizes in the map/unmap though so I don't think the emergency buffer should be used now unless more devices are also using the bounce buffers a lot.

We now need to test all 3 patches together on MacBook Pro 3.1

One idea I had was to see if we can force the usage on the bounce buffers on my systems which are not affected by changing the decision of when they will be used. I haven't yet found the exact code path which has this logic though but once I do I want to test it and force usage of it more to test against it.
Comment 36 Maciej Żenczykowski 2008-11-19 13:44:07 UTC
a) after thinking more about it, I don't think we need to add size - 1 to all the need_mapping checks:  because of the alignment of buffers that alloc_pages (skb/kmalloc etc) returns (and due to it never returning 0 as a valid alloc start location), we can't _ever_ end up with a situation where the start and end locations of a buffer have a different number of address bits.  Thus checking just the starting location should be fine, although it's {less|not| far from} obvious that it's correct.  A comment would help.

b) without the pci_map/unmap_single patch to ath9k, the driver should _always_ leak memory like crazy if it's using the sw-iommu...  with iommu=soft swiotlb=force, it should always use it, thus it should always run out quickly enough.  Since it's failing to fail... we must still be missing some interaction somewhere.
Comment 37 Luis Chamberlain 2008-11-19 14:09:11 UTC
Well keep in mind that the swiotlb bounce buffers should be used only when you want to pci_map_single() virtual memory > 32 bit boundary. One theory here is the MacBook Pro 3.1 get gets virtual memory > 32 bits when dev_alloc_skb() is used more regularly than other platforms. Not sure how true this could be... but its a theory.

But I also don't see the logic yet where the swiotlb will prefer bounce buffers over a direct map (when its < 32 bits or < 3GB).
Comment 38 Maciej Żenczykowski 2008-11-19 16:20:08 UTC
Yes, but theoretically (and if I'm reading the code correctly...) swiotlb=force should force the use of the sw-iommu for _ALL_ dma - even the dma below the 32-bit barrier.

The logic you're looking for is right here:
http://lxr.linux.no/linux+v2.6.27.6/lib/swiotlb.c#L570
Comment 39 Luis Chamberlain 2008-11-19 22:26:42 UTC
Thanks Maciej, yeah it should force to use the bounce buffers then.. hmm, I am not sure why I don't see the issue you are seeing when I do enable the swiotlb=force. It puzzles me even more.

Anyway I've updated the patches we had a bit, and added another one to tell the hardware to only DMA to us what we expect it to do so.

http://www.kernel.org/pub/linux/kernel/people/mcgrof/patches/ath9k/2008-11-20/DMA-01/

Please give those a shot, although I haven't ported them to 2.6.27 though... they apply to wireless-testing.
Comment 40 Alistair Strachan 2008-11-20 16:41:47 UTC
FWIW I was able to reproduce this problem with iommu=soft swiotlb=force mem=512M on a v2 Macbook, and I was also able to confirm that these patches do appear to fix the problem. When the problem occurs, it occurs almost immediately, but with the patches I've transferred several hundred megabytes over the interface now without any problems.

HTH.
Comment 41 Luis Chamberlain 2008-11-20 17:16:52 UTC
Thanks for testing, this is good news. Now I would just like to hear these news from a MacBook Pro 3.1 user as well. I've updated the patches with some more beacon.c changes. It should not affect the testing if you are not using AP or IBSS. But here are the latest patches anyway:

http://www.kernel.org/pub/linux/kernel/people/mcgrof/patches/ath9k/2008-11-21/DMA-01/
Comment 42 Luis Chamberlain 2008-11-20 20:12:49 UTC
Just FYI I ran into one oops after md5sum'ing my /dev/sda while doing an iperf on my box with iommu=soft swiotlb=force mem=300M. A temporary fix is here:

http://ruslug.rutgers.edu/~mcgrof/tmp/oops-fix.patch

This happens because we run out of memory and we were not checking if skb was allocated or not. But this is obviously a separate issue. We'll work on a nicer patch upstream.

With this patch at least your box won't crash. The device seems to becomes unusable though. Can't even rmmod ath9k. This would only happen if you run out of memory. We'll look into that too.
Comment 43 Toon Verwaest 2008-11-21 01:17:14 UTC
If somebody tells me which kernel version to use and which patches to apply I would be happy to give it a try. I am a bit confused about versions though and 2.6.27.7 seems to be unhappy about the patches :)

Thanks for actively looking for a solution!
Comment 44 Alistair Strachan 2008-11-21 02:49:06 UTC
This patch should work on 2.6.27, but I hacked it together in 5 minutes and it's not even compile tested. No sign off from me :-)

http://devzero.co.uk/~alistair/ath9k-fix-io-mmu-bounce-buffer-and-expected-rx-buffer-size-2.6.27.diff

I think something like this should make it into -stable, since ath9k is potentially very badly broken with >3.5GB pre-IOMMU Intel boxen.
Comment 45 Christoph Thiel 2008-11-21 03:15:59 UTC
I'm also seeing this on an AMD box: AMD 780G/SB700 (ASUS M3A78-EM) with AMD Phenom(tm) 9350e, 4G of RAM and Atheros AR5416 802.11abgn Wireless PCI Adapter (D-Link) running 2.6.27.5 on openSUSE 11.1 Beta5.

https://bugzilla.novell.com/show_bug.cgi?id=444506

Building a kernel with the patch from comment #44 now.
Comment 46 Maciej Żenczykowski 2008-11-21 04:39:35 UTC
Please be very careful while testing this on a machine you care about (better yet, don't...).  The swiotlb code is pretty buggy and if the fix for ath9k isn't quite right, it's very easy to redirect your wireless connection directly to your hard drive (not advised for data integrity... speaking from experience :-( )

I've got a cleaned up and slightly revised patch of the swiotlb code, hopefully will post it tomorrow after testing (it'll panic instead of wiping your hard disk), continuing to review the ath9k driver as well, should have some time tomorrow to do some further testing.
Comment 47 Alistair Strachan 2008-11-21 07:48:17 UTC
I agree, I experienced data loss while testing however ext3's fsck seemed to repair it just fine.
Comment 48 Christoph Thiel 2008-11-21 08:28:33 UTC
The patch from comment #44 works great on openSUSE 11.1 Beta5, running 2.6.27.6 + the patch now. I wasn't able to reproduce the problem for a couple of hours now, which usual would have only taken me a few minutes.
Comment 49 Luis Chamberlain 2008-11-21 18:27:13 UTC
Apologies for not clarifying what kernel the patches were for. They were for wireless-testing. For 2.6.27, can someone please try out these 3 patches:

http://www.kernel.org/pub/linux/kernel/people/mcgrof/patches/ath9k/2008-11-22/27-IOMMU-01/

The first and second one should be the same as Alistair Strachan's port of my patches on comment #44 but it removes the comments to try to keep the patch slim for 27. The third patch is new and its a port of how I resolved the oops I mentioned that can happen if you run out of memory. I haven't tested that patch so would appreciate if someone can test it. You can force your box to run out of memory by a lot of tricks but one good test is to boot with:

mem=100M iommu=soft swiotlb=force

I was able to reproduce the ENOMEM oops then quickly running a ping flood to my AP and at the same time running md5sum /dev/sda.

For those still looking to see if the latest patches fixes your issues but too lazy to compile your entire kernel and apply these patches you can try today's release of compat-wireless which now has the 2 new patches as John pulled these changes into wireless-testing:

http://www.orbit-lab.org/kernel/compat-wireless-2.6/2008/11/compat-wireless-2008-11-22.tar.bz2

I posted today a rework of my -ENOMEM oops fix, but you can download and apply that by wgeting from here:

http://www.kernel.org/pub/linux/kernel/people/mcgrof/patches/ath9k/2008-11-22/buf_link.patch

This patch above is for wireless-testing and also should apply for the above compat-wireless tarball. Apply witch path -p1 < buf_link.patch.

I would like to hear some solid reports by MacBook Pro 3.1 users.
Comment 50 Corey O'Connor 2008-11-22 12:15:20 UTC
I have  MacBook Pro 3.1 that did exhibit this issue within a few minutes of use. Unless ath9k was disabled and the option mem=3G was used.

I applied the 3 patches in comment #49 to Ubuntu's branch of 2.6.27-7; Compiled; Installed; Removed the mem=3G kernel option and re-enabled ath9k.

The system has been stable for several hours. I've recompiled the kernel twice while ping-flooding my AP. In addition to general web-surfing. No IO errors have been reported to /var/log/messages.

I have not done the test to reproduce the ENOMEM oops described in comment #49.

While the system is stable the system monitor reports 85% of the memory is in use by cache. I'm not sure if this is unusually or even related.
Comment 51 Corey O'Connor 2008-11-22 12:16:38 UTC
I forgot to mention: The system is using an x86_64 install.
Comment 52 Luis Chamberlain 2008-11-22 18:42:30 UTC
Thanks for the report and testing Corey. BTW for those concerned with testing the patches you run your kernel by appending to your grub boot option

swiotlb=panic

This will panic in case the software IO MMU runs out of bounce buffers. The leading theory so far is the partition corruption is due to the use of the emergency buffer once the bounce buffers run out. This would prevent you from using the emergency buffer.


More reports are welcomed. Looking good so far.
Comment 53 Maciej Żenczykowski 2008-11-23 20:32:54 UTC
Per http://lxr.linux.no/linux+v2.6.27.6/lib/swiotlb.c#L126 I don't think swiotlb=panic does anything, the syntax very much seems to be swiotlb=[number][,][force].
Comment 54 Christoph Thiel 2008-11-24 01:39:05 UTC
(In reply to comment #49)
> Apologies for not clarifying what kernel the patches were for. They were for
> wireless-testing. For 2.6.27, can someone please try out these 3 patches:
> 
>
> http://www.kernel.org/pub/linux/kernel/people/mcgrof/patches/ath9k/2008-11-22/27-IOMMU-01/

I have been using those patches for ~2 days now on 2.6.27.7, upcoming openSUSE 11.1 RC1 kernel. No problem so far. This however was not on a MacBook, but my AMD 780G/SB700 system.


> The first and second one should be the same as Alistair Strachan's port of my
> patches on comment #44 but it removes the comments to try to keep the patch
> slim for 27. The third patch is new and its a port of how I resolved the oops
> I
> mentioned that can happen if you run out of memory. I haven't tested that
> patch
> so would appreciate if someone can test it. You can force your box to run out
> of memory by a lot of tricks but one good test is to boot with:
> 
> mem=100M iommu=soft swiotlb=force

I did a quick test on this one just now, but booting with mem=100M doesn't succeed on my box :) Going up to mem=512M I was unable to force it into the ENOMEM oops.
Comment 55 Toon Verwaest 2008-11-24 07:16:39 UTC
Just to keep you up to date; I am trying out the patches of 

http://www.kernel.org/pub/linux/kernel/people/mcgrof/patches/ath9k/2008-11-22/27-IOMMU-01/

on my 2.6.27.7 kernel from post #49 on my macbook pro 3.1. Dmesg doesn't show any out of the ordinary errors on normal boot-up. Ping-flooding my access point gives this:

71238 packets transmitted, 64704 received, 9% packet loss, time 452905ms

Then I tried booting up with mem=100M; but just like in comment #54 I wasn't able to boot as I just got a kernel panic stating that there was too little memory. I however didn't try to find the fixpoint where it was little enough but not too little. 

I'll report back if something unusual happens; but until now it all seems pretty fine. 
Comment 56 Maciej Żenczykowski 2008-11-24 14:48:03 UTC
I'm running 2.6.27.7-50.mz6.fc9.x86_64 (MacBook Pro 3,1 with 4GB) at the moment (this is a self-patched fedora kernel from koji and includes a panic on swiotlb overflow patch and the above 3 ath9k patches) and wireless seems stable-ish.

Of note:

* sometimes (very rarely) the kernel fails to boot getting hung somewhere (apparently) in the PCI init sequence.  Last console message is:
  pci 0000:0c:00.0: PME# disabled
a few lines earlier is a pci-express related message, and I believe there were some changes relating to pci-express in the -50 kernel - perhaps that's the cause.  I can't reliably reproduce it and I believe I haven't seen it on a stock -50.fc9 kernel - so maybe I'm to blame (I've also only seen this on mz3-5 so maybe other patches were to blame, it's so _rare_ it's hard to be certain, and it also feels like it may depend on machine state from before the reboot)?

* wpa_supplicant and dhclient get into some disagreement with each other, as such getting dhclient to actually get an IP seems to require running wpa_cli reassociate soon after starting dhclient - probably because dhclient needlessly tries to up/down the interface - not quite sure what is happening here, but I think the ath9k driver is failing to send some message about state change that ath_pci did ;-)

* ping -s 42912 seems to be the highest value that works over wireless - no idea why.  Destination host is verified to accept and reply to -s 65507 pings from another machine (could be a 'firewall-somewhere-in-between' issue, but I doubt it, OTOH this failure mode makes absolutely no sense to me - IP fragmentation happens at a higher layer...)  ping -s 65507 works fine over built-in gigabit ethernet (sky2)

* running 802.11a WPAv1 TKIP/TKIP/802.1x successfully (although required the dhclient dinging around mentioned above)

  $ sudo ping -s 42912 -f -c 10000 vivio
  PING vivio 42912(42940) bytes of data.
  .........................       
  --- vivio ping statistics ---
  10000 packets transmitted, 9975 received, 0% packet loss, time 230773ms
  rtt min/avg/max/mdev = 26.577/122.987/184.624/7.326 ms, pipe 10, ipg/ewma 23.079/126.016 ms
Comment 57 Luis Chamberlain 2008-11-24 15:23:34 UTC
Thanks for the report Maciej, it seems this issue is now fixed so I'm going to close it. Patches have been provided for 2.6.27 and for wireless-testing. John already has the two DMA patches in his pending-fixes branch (for 2.6.28).

For any further issues please do report it as a new bug or provide patches (a lot more helpful ;) and thanks for all the help with testing.
Comment 58 Maciej Żenczykowski 2008-11-24 15:59:58 UTC
Created attachment 19009 [details]
swiotlb - disable use of overflow buffer, etc - tested

We should probably also get a fix for the swiotlb corruption into the kernel as well.  Any idea how (and to whom) to push a patch somewhat like the attached one?
Comment 59 Luis Chamberlain 2008-11-24 16:11:32 UTC
Agreed, please send them to:

To: linux-kernel
Cc: David Miller, Alan Cox, John Linville

Please CC me as well as I'd like to follow this issue.

Also please see if you can split up your patch into 3 separate patches which do what you mention in your commit log on a separate patch. This helps readability and review from others so it'll be easier to merge. Your patches should also apply against Linus' tree, not 2.6.27. If accepted then they can trickle down through 28 and 27. You can also add a Cc: stable on the commit log message and then stable maintainers (27) will then get an e-mail about the patch and that it should be merged (to 27 in this case).

And now for a shameless plug, for a git review please check out:

http://wireless.kernel.org/en/developers/Documentation/git-guide

In your case replace wireless-testing git tree with:

git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux-2.6.git
Comment 60 Luis Chamberlain 2008-11-24 16:15:51 UTC
Sorry, send it to Linus too :)
Comment 61 Luis Chamberlain 2008-12-05 14:57:47 UTC
2.6.27.8 was released with 2/3 of these patches:

Patch:          ftp://ftp.kernel.org/pub/linux/kernel/v2.6/patch-2.6.27.8.bz2
Full source:    ftp://ftp.kernel.org/pub/linux/kernel/v2.6/linux-2.6.27.8.tar.bz2
Comment 62 Mattzog Bellucci 2008-12-10 15:53:51 UTC
Hello, I am running a MacBook 3,1 with Ubuntu 8.10 (kernel is 2.6.27.9) and encountered this problem.  
In applying the fix, I am stymied about what to patch.  I downloaded the patch in post #61 thinking that, since I am beyond 2.8.27-8 that this would be the patch I need.  I am familiar with the patch command, but am for the most part pretty noob at linux.  I can troubleshoot my installation, but kernel hacking is beyond me.  

Can anyone provide some guidance?  Sorry if this is this wrong place to ask, but the solution to my wireless woes is tantalizingly close, and I can't figure out this last bit.

Thanks for any help you can give.
Comment 63 Maciej Żenczykowski 2008-12-10 21:04:05 UTC
http://bugzilla.kernel.org/show_bug.cgi?id=11811
Comment 64 Maciej Żenczykowski 2008-12-10 21:04:55 UTC
Not the comment I meant to make, rather:

2.6.27.9 isn't even out yet.  2.6.27.8 is the last released version (see http://kernel.org)
Comment 65 Mattzog Bellucci 2008-12-11 15:11:22 UTC
Like I said, I am fairly noob at linux, but uname -r told me "2.6.27-9-generic", which must have a different context than 2.6.27.8.  Please excuse me.  I'll do some research and stop bugging y'all.  Thx anyway.

Note You need to log in before you can comment on or make changes to this bug.