Bug 12394

Summary: 2.6.28 and greater: ath5k and p54usb: no association to acess point (regression)
Product: Drivers Reporter: Jan Bücken (jb.faq)
Component: network-wirelessAssignee: drivers_network-wireless (drivers_network-wireless)
Status: RESOLVED CODE_FIX    
Severity: normal CC: mcgrof, me
Priority: P1    
Hardware: All   
OS: Linux   
Kernel Version: 2.6.28 Subsystem:
Regression: Yes Bisected commit-id:
Attachments: strace iwlist wlan0 scan
with debugging enabled and some tests
log of the bisect for the "ap connection bug"

Description Jan Bücken 2009-01-09 09:16:57 UTC
Latest working kernel version:
2.6.27.10 (vanilla) and 2.6.27-gentoo-r7

Earliest failing kernel version:
2.6.28 (vanilla) and 2.6.28-gentoo

Distribution:
Gentoo

Hardware Environment:
amd64, 
Atheros Communications Inc. AR242x 802.11abg Wireless PCI Express Adapter (rev 01)

Software Environment:
ath5k, wpa_supplicant, wireless-tools

Problem Description:
With the ath5k driver in the kernel 2.6.28 it is not possible to connect to any access point. 
From time to time there are some messages in the kernel log, but not always:
ath5k: unsopported jumbo
ath5k: can't reset hardeware (-11)
ath5k phy0: noise floor calibration timeout (2412) MHz
phy0: failed to restore operational channel after scan

Scanning for access points with "wpa_cli scan && wpa_cli scan_results" works perfectly.

Scanning for access points with "iwlist wlan0 scan" doesn't work:
error message: print_scanning_info: Allocation failed.

Important: With the 2.6.27.10 there is none of these two problems.

Steps to reproduce:
Update to 2.6.28 kernel and use the ath5k driver...

Hint:
Don't know if this is important:
I test this only with WEP and WPA networks. I don't try to connect to an open network.
Comment 1 Bob Copeland 2009-01-12 07:31:17 UTC
Is SSID hidden?
Comment 2 Jan Bücken 2009-01-13 01:02:56 UTC
(In reply to comment #1)
> Is SSID hidden?
> 
no, 
and I tested it with an open network now, no difference
Comment 3 Bob Copeland 2009-01-13 07:05:19 UTC
Can you post output of 'strace iwlist wlan0 scan' ?
Comment 4 Jan Bücken 2009-01-14 03:31:41 UTC
Created attachment 19786 [details]
strace iwlist wlan0 scan

strace iwlist wlan0 scan &> strace_iwlist_wlan0_scan.

New info: It seems to me that 
iwlist wlan0 scan
print_scanning_info: Allocation failed
is a bug which happens only from time to time, too: 
Sometimes it scans, sometimes it doesn't.

But the "association bug" remains...

greetings
Jan
Comment 5 Bob Copeland 2009-01-14 13:50:10 UTC
Very weird.  Which version of wireless-tools?

Here is where things look broken:
> ioctl(3, SIOCGIWSCAN, 0x7fff75ae7cb0)   = -1 E2BIG (Argument list too long)
> mremap(0x7fc65d2e2000, 134221824, 268439552, MREMAP_MAYMOVE) = 0x7fc64d2e1000
> ioctl(3, SIOCGIWSCAN, 0x7fff75ae7cb0)   = -1 E2BIG (Argument list too long)
[...]

So we're asking for scan results with a 134 *meg* buffer, it fails so we
reallocate with 268 megs.

> mremap(0x7fc5ed2df000, 1073745920, 18446744071562072064, MREMAP_MAYMOVE) = -1
> EFAULT (Bad address)
> mmap(NULL, 18446744071562072064, PROT_READ|PROT_WRITE,
> MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = -1 ENOMEM (Cannot allocate memory)

We keep doubling until we wrap 32-bit int, then it goes negative so you
get ENOMEM.  

Looking at mac80211 (net/mac80211/scan.c) ieee80211_scan_results, I don't see
right away how you would get -E2BIG with any of those sizes, unless 
ieee80211_scan_result is hosed.  But there weren't major changes to it 
in 2.6.28.

As for association, can you turn on CONFIG_MAC80211_DEBUG_MENU and
CONFIG_MAC80211_VERBOSE_DEBUG then post whatever shows up in 'dmesg'
(if anything) when you try to associate?  
Comment 6 Jan Bücken 2009-01-22 08:04:57 UTC
Created attachment 19938 [details]
with debugging enabled and some tests

Sorry, I was busy with an exam...

(In reply to comment #5)
> Very weird.  Which version of wireless-tools?

I use wireless-tools 29

> As for association, can you turn on CONFIG_MAC80211_DEBUG_MENU and
> CONFIG_MAC80211_VERBOSE_DEBUG then post whatever shows up in 'dmesg'
> (if anything) when you try to associate?  


Steps I had done:
0) debugging enabled (CONFIG_MAC80211_VERBOSE_DEBUG and ath5k) in the 2.6.28 vanilla (not 2.6.28.1).

1) fresh reboot with this kernel 
effect: get no connection to any access point

NEW:
2) disabled the wireless lan with the keyboard button (seems to be hardware based, things like rfkill are disabled) 

3) wait some seconds and enabled the wlan again.
effect: get a connection to a access point, but loose it shortly, 
especally if I scan with "iwlist wlan0 scan" (sometimes it scans correctly)

4) Repeat step 2) and 3): It is reproducible.

You can see this in the dmesg output.
Comment 7 Jan Bücken 2009-01-28 06:59:54 UTC
New info: First failing kernel version is the 2.6.28-rc1.

It is save to do a (git) bisect between 2.6.27 and 2.6.28-rc1?
I mean, can it damage my hardware if I start my system with such a kernel? 
(and this are 3800 patches, is there an easy way?)

New Info: I told you in my previous comment that the card connects to an access point if I disable and enable it.
If I start a ping to a website, then the connection doesn't breake.
Comment 8 Bob Copeland 2009-01-28 07:11:35 UTC
Bisecting won't hurt your hardware, and really it's the only thing I can think of at the moment.  You can try excluding it to changes in net/ via:

$ git bieect start -- net
Comment 9 Bob Copeland 2009-01-28 07:11:58 UTC
(^typo, should be bisect)
Comment 10 Bob Copeland 2009-02-09 11:17:40 UTC
Any news on this one?
Comment 11 Jan Bücken 2009-02-09 13:56:46 UTC
(In reply to comment #10)
> Any news on this one?
> 

I'm sorry, I ran into trouble with the bisect, but I hang in there:
This is what I find out up to now:
First the kernel in the bisect doesn't compile, something with

drivers/built-in.o: In function `rtl8169_gset_xmii':
r8169.c:(.text+0x7e4b8): undefined reference to `mii_ethtool_gset'
make: *** [.tmp_vmlinux1] Fehler 1

and the modules does not build.
After skipping some of such kernels I decided to do the bisect without the realtek 8169 and no modules (all things build in)
But after testing 2.6.27 and 2.6.28-rc1 again, I find out, that both problems are not reproducible any time:
At the university the problems appear more often then at home: At the university are up to 80 access points, at home up to ten. Maybe this is one reason.
Next problem: After making sure that the 2.6.28-rc1 has the bug and the 2.6.27 has not, the first kernel between them in the bisect gets a kernel panic at boot. More exactly: I test it at the university and the kernel panic appears if and only if I activate the chip.
Until now I skipped some kernels but all get a kernel panic (test in the university). At home some of them boot up and I can test. Maybe the same problem: Too many access points near to the tuning range.
Hence its amazing to reboot the laptop every time, and so I have to spend more time on it.
Comment 12 Jan Bücken 2009-02-09 14:00:59 UTC
(In reply to comment #8)
> Bisecting won't hurt your hardware, 

Why are you so sure?
http://www.phoronix.com/scan.php?page=news_item&px=Njc0Nw

This can happen every time...
But I'll do the bisect, if there are no more unexpected problems.
Comment 13 Bob Copeland 2009-02-09 16:39:07 UTC
> Why are you so sure?
> http://www.phoronix.com/scan.php?page=news_item&px=Njc0Nw

Well, because there are no known ath5k bugs that brick the device.  If there are any unknown ones, then you might as well hit it using a stable kernel :)  Of course if you have e1000, that's another story.

> This can happen every time...
> But I'll do the bisect, if there are no more unexpected problems.

Actually I believe the issue has to do with large information elements in the scan results, combined with the fact that ath5k exports lots of channels so scans take a considerable time.  This can interrupt normal function of the card.  There are some changes in the pipeline to address some of this.  Though I don't think either of those issues are regressions, so there may be something else.
Comment 14 Jan Bücken 2009-02-18 12:31:21 UTC
1) The connecting problem to access points:
I believe bisecting is not useful, because the bug is not reproducible (at friends I get a connection every time) and the behavior of the bug chances between the bisect.
I get this bug:

a40c24a13366e324bc0ff8c3bb107db89312c984 is first bad commit
commit a40c24a13366e324bc0ff8c3bb107db89312c984
Author: David S. Miller <davem@davemloft.net>
Date:   Thu Sep 11 04:51:14 2008 -0700

    net: Add SKB DMA mapping helper functions.
    
    Signed-off-by: David S. Miller <davem@davemloft.net>

:040000 040000 2ab13c7cac689f67d97cb8f7ca42343713c53ca0 15a1e0f81f6e8f7eb7e6659a
0f7b6b983eeda420 M      include
:040000 040000 ff3568bfc0848c00927e97f7c6005a7857f9c0af c877f9af828cab1c62785ead
7cf3571202ab27a7 M      net
Comment 15 Jan Bücken 2009-02-18 12:34:24 UTC
2) For the "iwlist bug" I have to do a second bisect. The bug split (I get a connection to an access point, but iwlist wlan0 scan fails)

3) I'll test what happen if I revert the commit above in 2.6.28-rc1
Comment 16 Jan Bücken 2009-02-18 12:36:12 UTC
Created attachment 20304 [details]
log of the bisect for the "ap connection bug"

only the log
Comment 17 Jan Bücken 2009-02-26 05:13:47 UTC
> 
> 3) I'll test what happen if I revert the commit above in 2.6.28-rc1
> 

It is not possible to revert this bug in 2.6.28-rc1 (too many dependencies)
After testing this commit again (boot with this kernel), I could connect to an ap. I said it is not reproducible all the time. 
But: I use the wpa_supllicant and I have all networks disabled as standart. It seems to me that I can connect to an ap with more probability the faster I enable the network with the wpa_cli after reboot.
Comment 18 Jan Bücken 2009-02-26 05:17:35 UTC
(In reply to comment #15)
> 2) For the "iwlist bug" I have to do a second bisect. The bug split (I get a
> connection to an access point, but iwlist wlan0 scan fails)

I will wait for the 2.6.29 now and test both problems then, maybe they are gone.
I will do a new bisect if and only if the problems are still present then.
Comment 19 Jan Bücken 2009-04-01 12:51:23 UTC
both still present with 2.6.29 (gentoo-sources)
Comment 20 Bob Copeland 2009-04-01 14:14:29 UTC
Please post the dmesg of the attempt to associate with AP.  

You can also try this patch in the meantime:

http://marc.info/?l=linux-wireless&m=123841474910111&w=2
Comment 21 Jan Bücken 2009-04-09 19:55:16 UTC
(In reply to comment #20)
> Please post the dmesg of the attempt to associate with AP.  

It shows nothing... (tested with gentoo-sources-2.6.29-r1, CONFIG_MAC80211_DEBUG_MENU and
CONFIG_MAC80211_VERBOSE_DEBUG turned on)

> 
> You can also try this patch in the meantime:
> 
> http://marc.info/?l=linux-wireless&m=123841474910111&w=2

I will do this next
Comment 22 Jan Bücken 2009-04-09 19:59:14 UTC
oh I forgot: only wpa_cli repeats "CTRL-EVENT-SCAN-RESULTS" regulary

Happy Easter!
Comment 23 Jan Bücken 2009-06-28 22:19:39 UTC
Today I installed gentoo on an old desktop system.
I have an external "Siemens Gigaset 54 Usb" adapter.

I installed the 2.6.27-r8 and 2.6.29-r5 kernel (gentoo-sources).

And now an important new info: 
I thought this bug is a problem with the ath5k, but I have the same bug with the p54usb: It is all ok with the 2.6.27 but with 2.6.29 wpa-cli does not connect to the ap! (ap with WPA2).

My friend has a bcm4318 and uses the b43 - module. He is not affected by this bug.

Does ath5k and p54usb have any same dependecies / code they use, which does not has / is not used by the b43-module? Maybe we can narrow the regression / patch down, now.

I don't test this bug with 2.6.30 yet.
Comment 24 Jan Bücken 2009-07-03 20:46:25 UTC
(In reply to comment #23)

> I don't test this bug with 2.6.30 yet.

Test it. On both systems (with gentoo-sources-2.6.30-r1). It seems to be FIXED.