Bug 14402

Summary: Atheros ath9k module is not working with 2.6.31.1 on an Acer Extensa 7630EZ
Product: Platform Specific/Hardware Reporter: Bernhard (berndl81)
Component: x86-64Assignee: platform_x86_64 (platform_x86_64)
Status: CLOSED CODE_FIX    
Severity: normal CC: adam, alan-jenkins, ingmar.stdin, lenb, linville, mcgrof, mingo, rjw, sujith, yinghai
Priority: P1    
Hardware: All   
OS: Linux   
Kernel Version: 2.6.31.x Subsystem:
Regression: Yes Bisected commit-id:
Bug Depends on:    
Bug Blocks: 13615    
Attachments: remove duplicate rtc reset
0001-Revert-x86-e820-pci-reserve-extra-free-space-nea.patch
Ram align hack

Description Bernhard 2009-10-14 11:17:55 UTC
This bug was initially reported at the Archlinux Bugtracker (http://bugs.archlinux.org/task/16413)

The Atheros ath9k module is not working with [testing]-kernel 2.6.31.1 on an Acer Extensa 7630EZ. Downgrading to 2.6.30.x solves this problem

#################################################
#### Log with 2.6.30 Kernel (Wifi is working) ###
#################################################
[bernhard@wallaby ~]$ lsmod |grep ath
ath9k 356404 0
mac80211 211488 1 ath9k
cfg80211 78152 2 ath9k,mac80211
rfkill 13108 4 ath9k,acer_wmi
led_class 5112 2 ath9k,acer_wmi

[bernhard@wallaby ~]$ dmesg |grep ath
ath9k 0000:05:00.0: enabling device (0000 -> 0002)
ath9k 0000:05:00.0: PCI INT A -> GSI 17 (level, low) -> IRQ 17
ath9k 0000:05:00.0: setting latency timer to 64
phy0: Selected rate control algorithm 'ath9k_rate_control'
Registered led device: ath9k-phy0::radio
Registered led device: ath9k-phy0::assoc
Registered led device: ath9k-phy0::tx
Registered led device: ath9k-phy0::rx
#################################################


#####################################################
#### Log with 2.6.31 Kernel (Wifi is not working) ###
#####################################################
[bernhard@wallaby ~]$ dmesg |grep ath
ath9k 0000:05:00.0: enabling device (0000 -> 0002)
ath9k 0000:05:00.0: PCI INT A -> GSI 17 (level, low) -> IRQ 17
ath9k 0000:05:00.0: setting latency timer to 64
ath9k 0000:05:00.0: PCI INT A disabled
#####################################################

Steps to reproduce:
When I upgrade to a 2.6.31.x kernel, the wlan0 interface is detected and disabled instantly so I can't do something like iwlist wlan0 scan.

If any further information would be useful, I will happily provide it.
Comment 1 Bernhard 2009-10-14 11:18:42 UTC
I forgot to mention that the loaded modules are the same when running 2.6.30.x and 2.6.31.x kernels.
Comment 2 John W. Linville 2009-10-14 13:40:30 UTC
I suspect this relates to rfkill...there were a couple of acer-wmi rfkill fixes in the 2.6.31, possibly not enough of them...
Comment 3 Bernhard 2009-10-14 13:47:27 UTC
Hi John.

Thx for commenting on this one.

Is there anything I can try so that we can confirm this depends on acer-wmi?
Comment 4 John W. Linville 2009-10-14 14:00:08 UTC
git rev-list v2.6.30..v2.6.31 drivers/platform/x86/acer-wmi.c
ed5c8ef3bb2de277b7885072e0e981c41a022be5
a878417cc576720d3c9ff5399522d06f226bad7d
b3fa1329eaf2a7b97124dacf5b663fd51346ac19
19d337dff95cbf76edd3ad95c0cee2732c3e1ec5
621cac85297de5ba655e3430b007dd2e0da91da6

You could try reverting those patches?  Looks like it won't be a clean set of reverts.  Not sure what else to suggest as a quick test...?
Comment 5 Alan Jenkins 2009-10-14 14:18:44 UTC
I don't think acer-wmi is essential.  If it is doing something wrong, you should be able to prevent it from loading

echo "blacklist acer-wmi" >> /etc/modprobe.d/acer-wmi-blacklist.conf

(or move the module file elsewhere, or disable it in your kernel config...)
Comment 6 Adam 2009-10-14 14:24:55 UTC
I'm having a similar problem with asus_laptop.  At first, I could connect for ~20min, and then it would be killed (still probably rfkill related).

When unable to connect, I get "SIOCSIFFLAGS: Unknown error 132"
Comment 7 Alan Jenkins 2009-10-14 14:33:36 UTC
I don't think it is related to acer-wmi though.  The two fixes in 2.6.31 should cover everything.

I find it very suspicious that the led devices disappear.  They should really not be affected by rfkill, and as far as I can see this is confirmed by the code.

I don't know anything about ath9k or what has been changed recently.  But perhaps you should try CONFIG_ATH9K_DEBUG.  Check the help text, you will need to set a module parameter to get any output.
Comment 8 Alan Jenkins 2009-10-14 14:41:13 UTC
(In reply to comment #6)
> I'm having a similar problem with asus_laptop.  At first, I could connect for
> ~20min, and then it would be killed (still probably rfkill related).
> 
> When unable to connect, I get "SIOCSIFFLAGS: Unknown error 132"

Yes, that is the new RFKILL error code.  However asus-laptop does not expose any rfkill device, so your problem is more likely to lie with the wireless driver.  (asus-laptop does appear to export a custom wireless toggle interface, but I don't see any recent changes there).

Please confirm which kernel versions you are talking about (last known good version, first known bad version).
Comment 9 Bernhard 2009-10-14 14:43:47 UTC
Adam: I think your bug is something different. In my case, the device is ALWAYS immediately disabled again. so no chance to even connect.
Comment 10 John W. Linville 2009-10-14 14:44:36 UTC
Might be worth probing the rfkill subsystem using the rfkill utility:

   http://git.sipsolutions.net/rfkill.git

Re: Unknown error 132 (include/asm-generic/errno.h)

#define ERFKILL         132     /* Operation not possible due to RF-kill */

Eeek...ath9k, not ath5k -- sorry Bob!
Comment 11 Bernhard 2009-10-14 14:52:18 UTC
(In reply to comment #5)
> I don't think acer-wmi is essential.  If it is doing something wrong, you
> should be able to prevent it from loading
> 
> echo "blacklist acer-wmi" >> /etc/modprobe.d/acer-wmi-blacklist.conf
> 
> (or move the module file elsewhere, or disable it in your kernel config...)

I just blacklisted acer-wmi and booted 2.6.31.4

The result is exactly the same as in #1. The device is detected and immediately disabled.
Comment 12 Adam 2009-10-14 15:21:29 UTC
2.6.31.4 from Arch Linux here.

I mentioned asus-laptop since I thought it was similar to acer-wmi.

Some people in the arch forums were saying that the problem seemed to change based on if bluetooth was turned on or off.  If I have bluetooth off, I receive the error every time with wireless (same as Bernhard).  

asus-laptop does expose custom interface toggles which are the ones I use when enabling and disabling radios.
Comment 13 Bernhard 2009-10-14 15:26:26 UTC
for what i know, my laptop (extensa 7630ez) does not even have a bluetooth device which i could disable....
Comment 14 Alan Jenkins 2009-10-14 15:32:58 UTC
(In reply to comment #12)
> 2.6.31.4 from Arch Linux here.
> 
> I mentioned asus-laptop since I thought it was similar to acer-wmi.
> 
> Some people in the arch forums were saying that the problem seemed to change
> based on if bluetooth was turned on or off.  If I have bluetooth off, I
> receive
> the error every time with wireless (same as Bernhard).  
> 
> asus-laptop does expose custom interface toggles which are the ones I use
> when
> enabling and disabling radios.

Right.  That last sounds very much like a bug in asus-laptop, please do report it separately.  Since it's not a core ACPI driver, you're more likely to get a response if you email the maintainer (To: Corentin Chary <corentincj@iksaif.net>, CC: acpi4asus-user@lists.sourceforge.net).

The "20 minute" problem is a bit of a mystery.  I would tend to blame it on ath9k, but I suppose it could be due to some core ACPI change.

In any case, it would be much appreciated if you could open a new bug describing the 20-minute problem instead of commenting here.  It can always be merged back if it is found to be the same problem.
Comment 15 Adam 2009-10-14 15:39:55 UTC
Yep, will do once I get home and have access to my laptop.
Comment 16 John W. Linville 2009-10-14 16:40:10 UTC
Comment 12 suggests a link to bluetooth coexistence?
Comment 17 Luis Chamberlain 2009-10-14 16:42:09 UTC
This might be rfkill related, and is similar to bug report:

http://bugzilla.kernel.org/show_bug.cgi?id=13581


A copy and paste from my comment there, perhaps we should merge the two bug reports?

Note that rfkill was completely rewritten
for the 2.6.31 kernel.

Please try out the new rfkill userspace application to see if you can query the
rfkill status:

http://wireless.kernel.org/en/users/Documentation/rfkill

I think there is support for a command:

rfkill unblock all
Comment 18 Alan Jenkins 2009-10-14 16:51:17 UTC
(In reply to comment #14)
> (In reply to comment #12)
> Right.  That last sounds very much like a bug in asus-laptop, please do
> report
> it separately.  Since it's not a core ACPI driver, you're more likely to get
> a
> response if you email the maintainer (To: Corentin Chary
> <corentincj@iksaif.net>, CC: acpi4asus-user@lists.sourceforge.net).

Please disregard.  Since the disabling bluetooth also causes this problem on an EeePC (and therefore with eeepc-laptop), it's pretty unlikely that asus-laptop is the problem.
Comment 19 Adam 2009-10-14 16:55:22 UTC
Ahhh, sorry I always get that mixed up when I'm not at my machine.
Yes, it's an EeePC.
Comment 20 Bernhard 2009-10-14 17:25:50 UTC
i just installed the rfkill usertool (git version) and realized, that when i boot 2.6.31.x wlan was softblocked.

using rfkill unblock all i double checked, that the device was neither soft- nor hardblocked.

rmmod ath9k and modprobe ath9k led to the exactly same error message as in msg #1 (according to dmesg). the device is detected and disabled again. no chance to bring it up using ifconfig wlan0 up or something like this :(
Comment 21 Luis Chamberlain 2009-10-14 17:52:11 UTC
Odd -- I know nothing of rfkill, but can you try building with rfkill disabled?

Posting your kernel messages and userspace output of rfkill might help here. From what I understand from what you describe you did have your device rfkill'd and then you tried to unblock it using the rfkill userspace app but still were unable to bring the interface up, even after an rmmod/modprobe.

Does your laptop have rfkill buttons? Can you try to lift the blocking state without using the rfkill userspace app? You can use the userspace app to query the state.

How about the BIOS?
Comment 22 Bernhard 2009-10-14 17:57:44 UTC
(In reply to comment #21)
> Odd -- I know nothing of rfkill, but can you try building with rfkill
> disabled?
> 
> Posting your kernel messages and userspace output of rfkill might help here.
> From what I understand from what you describe you did have your device
> rfkill'd
> and then you tried to unblock it using the rfkill userspace app but still
> were
> unable to bring the interface up, even after an rmmod/modprobe.
> 
> Does your laptop have rfkill buttons? Can you try to lift the blocking state
> without using the rfkill userspace app? You can use the userspace app to
> query
> the state.
> 
> How about the BIOS?

1) i do not know anything about rfkill too, unfortunately.
2) yes, exactly. I booted 2.6.31, checked the rfkill state and realized that the wlan device was softblocked. I unblocked it using the rfkill tool and checked that afterwards it was not softblocked and not hardblocked. After that step I tried to get the device up by rmmod and modprobe and got exactly the same kernel-message as I have posted in #1

3) An rfkill button is to disable/enable wlan, right? yes, i do have such a button, but clicking it doesn't change anything when booting 2.6.31. I checked that with rfkill tool. (rfkill list).
Comment 23 Luis Chamberlain 2009-10-14 17:59:41 UTC
Can you try 'rfkill unblock all' after bootup and then try to bring the interface up -- that is, do not rmmod ath9k and modprobe it afterwards.
Comment 24 Bernhard 2009-10-14 18:26:37 UTC
Hi Luis.

Adding to #22: Suddenly the rfkill-button does work in a sense that it changes the softblock-state of wlan according to the output of rfkill list.

But: Installing 2.6.31, rfkill unblock all and ifconfig wlan0 up tells me that wlan0 does not exist (since it was disabled when booting). strange... especially since no one else is experiencing this as it seems...
Comment 25 Luis Chamberlain 2009-10-14 18:34:04 UTC
Can you provide output from dmesg when you try all these things?
Comment 26 Bernhard 2009-10-14 18:37:09 UTC
i did a tail -f /var/log/dmesg but when i ran all these commands, nothing (absolutely nothing) was added to dmesg :(
Comment 27 Luis Chamberlain 2009-10-14 22:45:44 UTC
OK then I'm afraid I cannot be of any more help at this point, as Alan pointed out you may want to communicate this to the platform driver maintainer for your laptop.
Comment 28 Bernhard 2009-10-15 05:22:04 UTC
(In reply to comment #27)
> OK then I'm afraid I cannot be of any more help at this point, as Alan
> pointed
> out you may want to communicate this to the platform driver maintainer for
> your
> laptop.
Ok, thx anyway a lot for trying to help. One question, though. What do you mean by "platform driver maintainer"? Some guy at Acer? Packager at Arch?
Comment 29 Alan Jenkins 2009-10-15 08:23:44 UTC
Bernhard: The "platform driver" means acer-wmi in your case (and the contact addresses could be found in the MAINTAINERS file).  But the evidence in _your_ case should rule it out, because the same problem happened when you blacklisted acer-wmi.  

I think it's just been too confusing in here with both you and Adam + John, me and Luis :-/.

The key is this:

> #####################################################
> #### Log with 2.6.31 Kernel (Wifi is not working) ###
> #####################################################
> [bernhard@wallaby ~]$ dmesg |grep ath
> ath9k 0000:05:00.0: enabling device (0000 -> 0002)
> ath9k 0000:05:00.0: PCI INT A -> GSI 17 (level, low) -> IRQ 17
> ath9k 0000:05:00.0: setting latency timer to 64

[working kernels print led messages here, but the non-working kernel does this:]

> ath9k 0000:05:00.0: PCI INT A disabled
> #####################################################

It strongly suggests that the ath9k driver failed to initialize the device.  It didn't get as far as creating the LEDs, before it gave up and disabled the PCI device again.

[Platform rfkill could potentially do evil things to cause that, but we've ruled it out.  And this is _not_ the expected behaviour for ath9k rfkill.  If ath9k sees that the device is rfkilled, it should not fail to create the wlan0 device.  It should create wlan0, but generate the new RFKILL error code (132) on "ifconfig up wlan0".]


As I say, I don't know anything about ath9k or it's recent development.  All _I_ can suggest is that you check out CONFIG_ATH9K_DEBUG and the associated module option.  Hopefully that will narrow down where the failure occurs in the initialisation sequence.
Comment 30 Luis Chamberlain 2009-10-15 16:05:30 UTC
Thanks Alan -- I missed the

PCI INT A disabled

Odd, given that ath_pci_probe() does print out any error messages using KERN_ERR so you should see them. The CONFIG_ATH9K_DEBUG will help if you at least pass the device initialization, in this case this is not even reached so therefore it won't help. Although we do use KERN_ERR on probe for errors to be safe set your default kernel print log level to print all out:

dmesg -n 8

If you don't get much out you could add printk's to the pci probe to see at least where it reached.

Another idea is to try the latest ath9k using directly wireless-testing [1] which requires a whole kernel compile or compat-wireless [2] which just requires the wireless modules to be compiled. compat-wireless is based on wireless-testing so its bleeding edge. If you want to try what is on 2.6.32 you could try the compat-wireless stable release as well [3].

[1] http://wireless.kernel.org/en/developers/Documentation#wireless-testing.git
[2] http://wireless.kernel.org/en/users/Download
[3] http://wireless.kernel.org/en/users/Download/stable
Comment 31 Alan Jenkins 2009-10-15 16:09:02 UTC
Unfortunately there's one failure that doesn't have a printk in v2.6.31.1.  I think it's the only one, so at least it narrows it down :-).

	if (ath_attach(id->device, sc) != 0) {
		ret = -ENODEV;
		goto bad3;
	}
Comment 32 Alan Jenkins 2009-10-15 16:13:20 UTC
And there are a fair number of DPRINTF's under ath_attach.  I assume they will show up in dmesg if one enables CONFIG_ATH9K_DEBUG & do "modprobe -r ath9k; modprobe ath9k debug=0xffffffff".
Comment 33 Luis Chamberlain 2009-10-15 17:06:49 UTC
Heh good catch -- I was looking at wireless-testing code an on that ath_pci_probe() always prints something out upon failure. Only in recent kernels will we propagate the exact error during hardware initialization so what would be good is for Bernhard to try a later version to see where the hardware initialization fails (assuming that is where this hits).

Using wireless-testing or compat-wireless would be good.

In bleeding edge you actually now should not need to enable CONFIG_ATH9K_DEBUG to enable debug prints but its still a good idea. Either way as Alan suggests please do ensure to use:

modprobe ath9k debug=0xffffffff

On compat-wireless you can enable CONFIG_ATH9K_DEBUG by setting it on config.mk, its already there just remove the # which makes it a comment now.
Comment 34 Bernhard 2009-10-15 17:24:29 UTC
Hi everybody.

I'll try to compile compat-testing and will report back if I succeed.
Comment 35 Bernhard 2009-10-18 06:46:09 UTC
I did not manage to compile and run compat-testing nor compiling a custom kernel with CONFIG_ATH9K_DEBUG enabled.

However, perhaps this is helpful. I realized that if I boot any 2.6.31 kernel with acpi=off, everything works fine. wlan0 device is created and I can connect just fine.
Comment 36 Luis Chamberlain 2009-10-26 19:46:17 UTC
Please report back -- I'd like to see the output of loading ath9k with debugging completely enabled on the latest ath9k driver.

What issues are you having?
Comment 37 Bernhard 2009-10-26 20:02:31 UTC
Hi Luis.

First of, the situation is the still the  same with any 2.6.31.x kernel as described in #1.

Unfortunately, I don't have any clue how to compile a kernel with debugging enabled. I am sorry but I have not yet compiled any kernel by myself and I am pretty helpless in this case.

What I still experience (also with 2.6.31.5) is that the device is enable and working if I boot with acpi=off (if this if of any help).

The debug-messages in this case are the same as for the (working) 2.6.30.x kernels.
Comment 38 Luis Chamberlain 2009-10-26 20:19:35 UTC
You can easily enable ath9k debugging if you just use compat-wireless and edit config.mk to enable CONFIG_ATH9k_DEBUG.

http://wireless.kernel.org/en/users/Download

This is look for ATH9K on config.mk and make sure this is set:

CONFIG_ATH9K=m
CONFIG_ATH9K_DEBUG=y

Then run:

make
sudo make install


reboot
Comment 39 Luis Chamberlain 2009-10-26 20:30:37 UTC
Oh you also need CONFIG_ATH_DEBUG enabled
Comment 40 Bernhard 2009-10-27 14:11:09 UTC
Hi Luis. I compiled compat-wireless (the latest) and managed to get the following output:

1) relevant parts of dmesg after make unload && modprobe ath9k
http://pastebin.com/f2b5d59f7

2) after rebooting the laptop with the latest drivers
http://pastebin.com/f26cfc51b

I hope this is of any help.

Bernhard
Comment 41 Luis Chamberlain 2009-10-27 14:34:41 UTC
What changes did you make to config.mk ? Can you paste the diff here.
Comment 42 Bernhard 2009-10-27 15:11:27 UTC
I ran ./scripts/driver-select ath9k and then just uncommented CONFIG_ATH9K_DEBUG=y in config.mk. That's the only change I made.

However, I just now realized that I should have uncommented CONFIG_ATH_DEBUG. I will try that now and report back.
Comment 43 Luis Chamberlain 2009-10-27 15:12:59 UTC
Yeah you also need CONFIG_ATH_DEBUG.
Comment 44 Bernhard 2009-10-27 15:44:13 UTC
ok. I seem to have another problem. I did the following

1) Installed kernel 2.6.31.5 from arch-repositories and rebooted (wlan device not detected as in #1)
2) downloaded and tarred bleeding edge compat-wireless
3) ran ./scripts/driver-select ath9k
4) make
5) changed config.mk (diff to original: http://pastebin.com/f52bb030)
6) make install

-> here something strange happened. I realized two warning messages about ath_print not defined in ath9k_hw.ko and ath9k.ko) but install went through
7) make unload
8) modprobe ath9k -> dmesg-output: http://pastebin.com/f61a00fab

9) rebooted: -> dmesg-output after reboot: http://pastebin.com/f218d1390

so it seems that due to enabling CONFIG_ATH_DEBUG all wlan-specifig output is gone from dmesg....
Comment 45 Luis Chamberlain 2009-10-27 15:57:07 UTC
4) make
5) changed config.mk (diff to original: http://pastebin.com/f52bb030)
6) make install


This order is incorrect. You should do it in this order:

4) changed config.mk (diff to original: http://pastebin.com/f52bb030)
5) make
6) make install
Comment 46 Bernhard 2009-10-27 16:03:04 UTC
sorry, I made an error copy and pasting. Of course I edited config.mk before make && make install.

sorry for the confusion.
Comment 47 Luis Chamberlain 2009-10-27 16:09:26 UTC
So ath_print comes from ath.ko -- can you run

modprobe -l ath

What do you get?

We need to ensure you are building ath.ko on compat-wireless, you can ensure this by having:

CONFIG_ATH_COMMON=m
CONFIG_ATH_DEBUG=y
Comment 48 Bernhard 2009-10-27 17:07:26 UTC
Hi,

I have not enabled everything suggested here:
CONFIG_ATH9K=m
CONFIG_ATH9K_DEBUG=y
CONFIG_ATH_COMMON=m
CONFIG_ATH_DEBUG=y

While doing make: (building modules, stage 2:) the following warnings occured:

WARNING: "ath_print" [/home/bernhard/compat/compat-wireless-2009-10-27/drivers/net/wireless/ath/ath9k/ath9k_hw.ko] undefined!
WARNING: "ath_print" [/home/bernhard/compat/compat-wireless-2009-10-27/drivers/net/wireless/ath/ath9k/ath9k.ko] undefined!

After make install I unloaded the modules

[bernhard@wallaby compat-wireless-2009-10-27]$ sudo make unload
Unloading ath...
Unloading mac80211...

which seemed to work fine, however modprobing ath9k resulted in:

[bernhard@wallaby compat-wireless-2009-10-27]$ sudo modprobe ath9k
FATAL: Error inserting ath9k (/lib/modules/2.6.31-ARCH/updates/drivers/net/wireless/ath/ath9k/ath9k.ko): Unknown symbol in module, or unknown parameter (see dmesg)

The corresponding dmesg-output is:

dmesg:
cfg80211: Calling CRDA to update world regulatory domain
ath9k_hw: Unknown symbol ath_print

To make sure ath is build on compat-wireless:

[bernhard@wallaby compat-wireless-2009-10-27]$ sudo modprobe -l ath
updates/drivers/net/wireless/ath/ath.ko

after reboot I find (only) the following line in dmesg relating to ath:
[bernhard@wallaby ~]$ dmesg |grep ath
ath9k_hw: Unknown symbol ath_print

The device is still not created.

Thx for your interest in this bug and guiding me through all this stuff btw!
Bernhard
Comment 49 Luis Chamberlain 2009-10-27 17:48:30 UTC
This is just odd. This is from ath/debug.h

#ifdef CONFIG_ATH_DEBUG
void ath_print(struct ath_common *common, int dbg_mask, const char *fmt, ...);
#else
static inline void ath_print(struct ath_common *common,
                             int dbg_mask,
                             const char *fmt, ...)
{
}
#endif /* CONFIG_ATH_DEBUG */


Then the ath/Makefile has:

ath-$(CONFIG_ATH_DEBUG) += debug.o

and ath/debug.c has the ath_print implemented and exported:

void ath_print(struct ath_common *common, int dbg_mask, const char *fmt, ...)
{
        va_list args;

        if (likely(!(common->debug_mask & dbg_mask)))
                return;

        va_start(args, fmt);
        printk(KERN_DEBUG "ath: ");
        vprintk(fmt, args);
        va_end(args);
}
EXPORT_SYMBOL(ath_print);

I suspect you might be doing something wrong but I cannot tell what.

Do you also have CONFIG_ATH9K_HW=m  ?
Comment 50 Bernhard 2009-10-27 17:53:18 UTC
Hi Luis. This is the config.mk file I have used

http://pastebin.com/f441d160e

I downloaded compat-wireless from here
http://wireless.kernel.org/download/compat-wireless-2.6/compat-wireless-2.6.tar.bz2
Comment 51 Luis Chamberlain 2009-10-27 18:09:10 UTC
Well ok but you ran ./scripts/driver-select ath9k

I am thinking I never tested this with debug so its possible that the ath/Makefile gets fuckered up for debug.

Do me a favor and download that tarball again, and do not run driver-select, instead just make sure you uncomment the config options for debugging as discussed. The compile everything (make) and install.
Comment 52 Bernhard 2009-10-27 18:51:25 UTC
Ok, I did as you suggested getting the following dmesg-output after unloading and re-modprobing ath9k:

-------------------

ath9k: Driver unloaded
cfg80211: Calling CRDA to update world regulatory domain
ath9k 0000:05:00.0: PCI INT A -> GSI 17 (level, low) -> IRQ 17
ath9k 0000:05:00.0: setting latency timer to 64
ath: timeout (100000 us) on reg 0x7044: 0xdccebfdb & 0x0000000f != 0x00000002
ath: Couldn't reset chip
ath: Unable to initialize hardware; initialization status: -5
ath9k 0000:05:00.0: failed to initialize device
ath9k 0000:05:00.0: PCI INT A disabled
ath9k: probe of 0000:05:00.0 failed with error -5
Comment 53 Luis Chamberlain 2009-10-27 19:16:22 UTC
And this used to work on 2.6.30, so a regression must've obviously have been introduced. Can you do a git bisect between the two kernel releases? It'll take a while but it would help. At this point I don't see anything obvious and since you did mention that disabling ACPI with acpi=off this could also not even be related to the driver but something change ACPI.

I think a bisect between 2.6.30 and 2.6.31 would therefore be very useful and valuable.
Comment 54 Luis Chamberlain 2009-10-27 19:18:01 UTC
Adding Len Brown -- perhaps he might be aware of some ACPI regression on 2.6.31 pending resolution.
Comment 55 Luis Chamberlain 2009-10-27 19:21:39 UTC
What's your lspci output for the Atheros device?
Comment 56 Luis Chamberlain 2009-10-27 19:29:17 UTC
Adding some notes on the ath9k front -- this seems to fail at the very first register write to the hardware. The first write to the hardware happens when we try to initialize the hardware through ath9k_hw_init(). The first call to access the hardware is done in that routine as follows:

        if (!ath9k_hw_set_reset_reg(ah, ATH9K_RESET_POWER_ON)) {
                ath_print(common, ATH_DBG_FATAL,
                          "Couldn't reset chip\n");
                return -EIO;
        }

This is small enough so I'll just paste the code here:

static bool ath9k_hw_set_reset_reg(struct ath_hw *ah, u32 type)
{
        REG_WRITE(ah, AR_RTC_FORCE_WAKE,
                  AR_RTC_FORCE_WAKE_EN | AR_RTC_FORCE_WAKE_ON_INT);

        switch (type) {
        case ATH9K_RESET_POWER_ON:
                return ath9k_hw_set_reset_power_on(ah);
        case ATH9K_RESET_WARM:
        case ATH9K_RESET_COLD:
                return ath9k_hw_set_reset(ah, type);
        default:
                return false;
        }
}

static bool ath9k_hw_set_reset_power_on(struct ath_hw *ah)
{
        REG_WRITE(ah, AR_RTC_FORCE_WAKE, AR_RTC_FORCE_WAKE_EN |
                  AR_RTC_FORCE_WAKE_ON_INT);

        if (!AR_SREV_9100(ah))
                REG_WRITE(ah, AR_RC, AR_RC_AHB);

        REG_WRITE(ah, AR_RTC_RESET, 0);
        udelay(2);

        if (!AR_SREV_9100(ah))
                REG_WRITE(ah, AR_RC, 0);

        REG_WRITE(ah, AR_RTC_RESET, 1);

        if (!ath9k_hw_wait(ah,
                           AR_RTC_STATUS,
                           AR_RTC_STATUS_M,
                           AR_RTC_STATUS_ON,
                           AH_WAIT_TIMEOUT)) {
                ath_print(ath9k_hw_common(ah), ATH_DBG_RESET,
                          "RTC not waking up\n");
                return false;
        }

        ath9k_hw_read_revisions(ah);

        return ath9k_hw_set_reset(ah, ATH9K_RESET_WARM);
}


So the issue comes from the RTC_STATUS not coming back happy. I do see one thing worth trying... I'll post a patch for that.
Comment 57 Luis Chamberlain 2009-10-27 19:30:37 UTC
Created attachment 23551 [details]
remove duplicate rtc reset

This removes the duplicate rtc reset setting. Please give this a try on either compat-wireless or on wireless-testing.
Comment 58 Bernhard 2009-10-27 19:40:00 UTC
(In reply to comment #55)
> What's your lspci output for the Atheros device?

[bernhard@wallaby ~]$ lspci |grep Network
05:00.0 Network controller: Atheros Communications Inc. AR928X Wireless Network Adapter (PCI-Express) (rev 01)
Comment 59 Bernhard 2009-10-27 20:13:18 UTC
(In reply to comment #57)
> Created an attachment (id=23551) [details]
> remove duplicate rtc reset
> 
> This removes the duplicate rtc reset setting. Please give this a try on
> either
> compat-wireless or on wireless-testing.

I just tested with your patch applied and getting exactly the same output as before in #52. The device is still not initialized unfortunately.
Comment 60 Luis Chamberlain 2009-10-27 20:20:02 UTC
Then yeah I cannot see what could be bad with the driver code at this point as that is a very basic initial driver write settings to hardware. Something is preventing the write to the card to be issued correctly.

I'd recommend a bisect.
Comment 61 Bernhard 2009-10-28 05:44:36 UTC
Hm, from what I read now about git-bisect. Do I understand it correctly that after each step the kernel (with the source depending on the bitsect-step) needs to be compiled?
Comment 62 Ingmar Vanhassel 2009-10-28 07:41:34 UTC
(In reply to comment #61)
> Hm, from what I read now about git-bisect. Do I understand it correctly that
> after each step the kernel (with the source depending on the bitsect-step)
> needs to be compiled?

Compiled & tested during each step, and tell git-bisect whether you're seeing the bad behaviour or not so it can decide what the next step should be.
Comment 63 Bernhard 2009-10-30 12:38:13 UTC
Just for info:

On my way to bisect the kernel to find the problematic revision, I tried to find a minimal kernel-config and somehow managed to compile my first kernel, 2.6.32-rc5. Surprisingly, the wiresless device works out of the box now again. 

Is it safe to assume that - since I have already tested the device with compat-wireless under 2.6.31 which didn't make the device work - another part of the kernel introduced the regression?

If so, how should I proceed? Still trying to bisect or just report this bug elsewhere?
Comment 64 Bernhard 2009-11-02 09:51:47 UTC
Ok, finally I managed to do the bisect, sorry for the delay

bisect-log output: http://pastebin.com/f53977eda  
first bad commit:  http://pastebin.com/f376cdb50

Bernhard
Comment 65 Luis Chamberlain 2009-11-02 16:21:25 UTC
Whoa, interesting, can you now just try to use wireless-testing and revert that single patch and see if indeed reverting it fixes your issue?
Comment 66 Luis Chamberlain 2009-11-02 16:23:26 UTC
Well you don't need to use wireless-testing, whatever git tree you use, just rever that one commit sha1 and see if that fixes the issue indeed. And please confirm that with it enabled you do get the issue you describe in this bug report (ath9k probe fails due to the RTC reset failing, which is the first thing we do for ath9k).
Comment 67 Bernhard 2009-11-02 16:28:20 UTC
I will for sure do this if anyone can tell me how to revert a single patch....
Comment 68 Luis Chamberlain 2009-11-02 16:32:34 UTC
Sure, in your case its:

git revert 45fbe3ee01b8e463b28c2751b5dcc0cbdc142d90

But please be very sure you are done with the bisect.

git bisect reset
Comment 69 Bernhard 2009-11-03 15:23:52 UTC
it seems that I am too stupid to revert the patch. I did checkout the lastest 2.6.31.x kernel and then did a 

git revert 45fbe3ee01b8e463b28c2751b5dcc0cbdc142d90

and ended with

[bernhard@wallaby linux-2.6.31.y]$ git revert 45fbe3ee01b8e463b28c2751b5dcc0cbdc142d90
warning: too many files (created: 1571 deleted: 336), skipping inexact rename detection
Automatic revert failed.  After resolving the conflicts,
mark the corrected paths with 'git add <paths>' or 'git rm <paths>' and commit the result.

now, I absolutely have no clue how to resolve the conflict and provide what you asked for....
Comment 70 Luis Chamberlain 2009-11-03 15:34:25 UTC
git reset --hard origin

Then compile and confirm your issue is present. Then

git revert 45fbe3ee01b8e463b28c2751b5dcc0cbdc142d90

Compile and verify your issue is gone.
Comment 71 Bernhard 2009-11-03 16:06:08 UTC
Compiling 2.6.31.5 (after git resect) results in the device not detected. However, I am unable to build the kernel after the revert with make failing with the following error (sorry for the german error msgs)...
  
CC      arch/x86/kernel/e820.o
arch/x86/kernel/e820.c:1368: Fehler: expected identifier or »(« before »<<« token
arch/x86/kernel/e820.c:1388: Fehler: expected identifier or »(« before »==« token
arch/x86/kernel/e820.c:1389:9: Fehler: ungültiger Suffix »fbe3e...« an Ganzzahlkonstante
arch/x86/kernel/e820.c:1423:9: Fehler: ungültiger Suffix »fbe3e...« an Ganzzahlkonstante
make[2]: *** [arch/x86/kernel/e820.o] Fehler 1
make[1]: *** [arch/x86/kernel] Fehler 2
make: *** [arch/x86] Fehler 2

I think the problem is that I cannot properly revert the patch with the error msg given in #69.
Comment 72 Luis Chamberlain 2009-11-03 16:19:17 UTC
That patch is very small, it should not cause so many issues. Most likely you never git bisect reset properly so a revert during a rebase (bisect) could have confused git.

Let me try to see if I run into the same compile issue as you do with a revert of the same patch.
Comment 73 John W. Linville 2009-11-03 16:20:28 UTC
Created attachment 23638 [details]
0001-Revert-x86-e820-pci-reserve-extra-free-space-nea.patch

Try applying this instead of the doing the 'git revert...' yourself...
Comment 74 Luis Chamberlain 2009-11-03 16:31:06 UTC
Don't forget to

git reset --hard origin

before applying
Comment 75 Luis Chamberlain 2009-11-03 17:38:32 UTC
OK I see the revert issue you were describing. The problem is that there was a patch after that one that touched the same code area, so reverting just that commit sha1 is a bit more trickier.

Trying to see if I can resolve the conflicts manually.
Comment 76 Luis Chamberlain 2009-11-03 17:39:29 UTC
Oh wait, John you resolved that didn't you.
Comment 77 Luis Chamberlain 2009-11-03 17:46:49 UTC
Indeed, sorry for the confusion, please just:

git reset --hard origin

and apply John's patch. If that does fix your issue then please try to apply it then to ensure you do get the issue *with* the patch applied.
Comment 78 Luis Chamberlain 2009-11-03 18:02:28 UTC
Bernhard, sorry for the trouble but it seems I gave you instructions on something that will probably have also not gotten you a pristine 2.6.31.5. When you run:

git reset --hard origin

it will checkout a fresh tree from origin/master which if you're using Linus' tree today is based on 2.6.32 and not 2.6.31. So to get your 2.6.31 please trash your older branch and create a fresh one based on 2.6.31.5.

What tree are you using?

To be sure you can check the top level Makefile to ensure you are on 2.6.31.5

Sorry for any confusion.
Comment 79 Bernhard 2009-11-03 19:10:40 UTC
first of, i checked out the latest 2.6.31.y tree, made the kernel and faced the problem described in #1. Then I applied the patch (thx john), made the kernel again and the problem was not solved (still no wlan interface).

BUT: What came to my mind. Whenever I face the problem that the wlan device is disabled, I get a warning about a failed phy ?? from my ethernet device (tg3) which is according to lspci a 

09:00.0 Ethernet controller: Broadcom Corporation NetXtreme BCM5764M Gigabit Ethernet PCIe (rev 10)

With any working kernel (2.6.30.x, 2.6.32-rc5 for example) I don't get this warning. Perhaps this is the reason....
Comment 80 Luis Chamberlain 2009-11-03 19:26:41 UTC
Well you can compile your kernel without that driver to verify your theory, or maybe disable it on the BIOS.

Anyway -- you did a git bisect and managed to find a culprit commit but it did not seem to lead to a proper culprit patch. I noticed you did a git bisect skip in your history, do your recall doing this or was this automatically done somehow?

Lets try to take advantage of your git bisect history though.

Your latest good sha1 is 134cbf35c739bf89c51fd975a33a6b87507482c4 which is right after 2.6.30-rc5. Can you ensure double check this is a good commit with:

git checkout -b x86-mm-patch-01 134cbf35c739bf89c51fd975a33a6b87507482c4

and compiling and loading that kernel. If that is good then its probably the end limit of what we need to test and if your git bisect log is correct the next bad is 5d423ccd7ba4285f1084e91b26805e1d0ae978ed so please test to confirm that is bad with:

git checkout -b x86-mm-patch-03 5d423ccd7ba4285f1084e91b26805e1d0ae978ed

Test that to confirm it is bad. If so then we have only 1 patches to test:

git checkout -b x86-mm-patch-02 45fbe3ee01b8e463b28c2751b5dcc0cbdc142d90

Test that..

So if you do have some time to test please test those sha1sums. Hopefully one of them will be good and one of them bad.
Comment 81 Bernhard 2009-11-03 20:18:11 UTC
(In reply to comment #80)
> Well you can compile your kernel without that driver to verify your theory,
> or
> maybe disable it on the BIOS.

I tried this, it seems that tg3 is not the reason for the observed behaviour since I compiled the kernel completely without ethernet support and experienced the same behaviour.

> Lets try to take advantage of your git bisect history though.
> Your latest good sha1 is 134cbf35c739bf89c51fd975a33a6b87507482c4 which is
> right after 2.6.30-rc5. Can you ensure double check this is a good commit
> with:
> 
> git checkout -b x86-mm-patch-01 134cbf35c739bf89c51fd975a33a6b87507482c4
> 
> and compiling and loading that kernel. 
I just booted this kernel and can confirm that this is indeed a good one. Will now compile the remaining sha's and report back...
Comment 82 Bernhard 2009-11-03 20:30:30 UTC
(In reply to comment #80)
> ... if your git bisect log is correct the
> next bad is 5d423ccd7ba4285f1084e91b26805e1d0ae978ed so please test to
> confirm
> that is bad with:
> git checkout -b x86-mm-patch-03 5d423ccd7ba4285f1084e91b26805e1d0ae978ed
> 
> Test that to confirm it is bad. 

Yes, this one is bad.

> If so then we have only 1 patches to test:
> 
> git checkout -b x86-mm-patch-02 45fbe3ee01b8e463b28c2751b5dcc0cbdc142d90
> 
> Test that..
This one is good.

 
> So if you do have some time to test please test those sha1sums. Hopefully one
> of them will be good and one of them bad.
hope is the poor man's bread :-)
Comment 83 Luis Chamberlain 2009-11-03 22:05:30 UTC
OK great, now please checkout a clean 2.6.31.5 branch:

git checkout -b linux-2.6.31.y v2.6.31.5

compile and load and confirm that kernel has an issue.

Then run:

git revert 5d423ccd7ba4285f1084e91b26805e1d0ae978ed

compile and load and confirm the issue is resolved.
Comment 84 Bernhard 2009-11-04 14:18:39 UTC
(In reply to comment #83)
> OK great, now please checkout a clean 2.6.31.5 branch:
> 
> git checkout -b linux-2.6.31.y v2.6.31.5
> 
> compile and load and confirm that kernel has an issue.

confirmed. wlan-device is disabled on boot

> Then run:
> 
> git revert 5d423ccd7ba4285f1084e91b26805e1d0ae978ed
> 
> compile and load and confirm the issue is resolved.
confirm, issue is resolved by reverting this commit.

bernhard
Comment 85 Luis Chamberlain 2009-11-05 00:15:02 UTC
Created attachment 23657 [details]
Ram align hack

Bernhard, get a clean 2.6.31.5 branch and then apply the attached patch with:

git am ram-align-hack.patch

Compile and see if that also resolves your woes with ath9k.
Comment 86 Luis Chamberlain 2009-11-05 01:41:33 UTC
Also, please post the full log (dmesg > log.txt) of the bootup with the new patch applied.
Comment 87 Luis Chamberlain 2009-11-05 15:36:50 UTC
Bernhard also please attach "debug" the the kernel parameters line. On grub this would be editing /boot/grub/menu.lst and for your specific kernel line add debug.
Comment 88 Yinghai Lu 2009-11-05 18:47:36 UTC
(In reply to comment #87)
> Bernhard also please attach "debug" the the kernel parameters line. On grub
> this would be editing /boot/grub/menu.lst and for your specific kernel line
> add
> debug.

you may need 

CONFIG_PCI_DEBUG=y
Comment 89 Yinghai Lu 2009-11-05 18:56:13 UTC
from 2.6.30


BIOS-provided physical RAM map:
 BIOS-e820: 0000000000000000 - 000000000009f400 (usable)
 BIOS-e820: 000000000009f400 - 00000000000a0000 (reserved)
 BIOS-e820: 00000000000d2000 - 00000000000d4000 (reserved)
 BIOS-e820: 00000000000e0000 - 0000000000100000 (reserved)
 BIOS-e820: 0000000000100000 - 00000000b5aa1000 (usable)
 BIOS-e820: 00000000b5aa1000 - 00000000b5aa7000 (reserved)
 BIOS-e820: 00000000b5aa7000 - 00000000b5bba000 (usable)
 BIOS-e820: 00000000b5bba000 - 00000000b5c0f000 (reserved)
 BIOS-e820: 00000000b5c0f000 - 00000000b5d08000 (usable)
 BIOS-e820: 00000000b5d08000 - 00000000b5f0f000 (reserved)
 BIOS-e820: 00000000b5f0f000 - 00000000b5f19000 (usable)
 BIOS-e820: 00000000b5f19000 - 00000000b5f1f000 (reserved)
 BIOS-e820: 00000000b5f1f000 - 00000000b5f64000 (usable)
 BIOS-e820: 00000000b5f64000 - 00000000b5f9f000 (ACPI NVS)
 BIOS-e820: 00000000b5f9f000 - 00000000b5fe2000 (usable)
 BIOS-e820: 00000000b5fe2000 - 00000000b5fff000 (ACPI data)
 BIOS-e820: 00000000b5fff000 - 00000000b6000000 (usable)
 BIOS-e820: 0000000100000000 - 0000000140000000 (usable)

Allocating PCI resources starting at b8000000 (gap: b6000000:4a000000)

pci 0000:00:1f.3: reg 10 64bit mmio: [0x000000-0x0000ff]
pci 0000:00:1f.3: reg 20 io port: [0x1c00-0x1c1f]
pci 0000:00:1c.0: bridge io port: [0x00-0xfff]
pci 0000:00:1c.0: bridge 32bit mmio: [0x000000-0x0fffff]
pci 0000:00:1c.0: bridge 64bit mmio pref: [0x000000-0x0fffff]
pci 0000:05:00.0: reg 10 64bit mmio: [0x000000-0x00ffff]
pci 0000:05:00.0: supports D1
pci 0000:05:00.0: PME# supported from D0 D1 D3hot
pci 0000:05:00.0: PME# disabled
pci 0000:00:1c.1: bridge io port: [0x00-0xfff]
pci 0000:00:1c.1: bridge 32bit mmio: [0x000000-0x0fffff]
pci 0000:00:1c.1: bridge 64bit mmio pref: [0x000000-0x0fffff]
pci 0000:00:1c.2: bridge io port: [0x00-0xfff]
pci 0000:00:1c.2: bridge 32bit mmio: [0x000000-0x0fffff]
pci 0000:00:1c.2: bridge 64bit mmio pref: [0x000000-0x0fffff]
pci 0000:09:00.0: reg 10 64bit mmio: [0x000000-0x00ffff]
pci 0000:09:00.0: PME# supported from D3hot D3cold
pci 0000:09:00.0: PME# disabled
pci 0000:00:1c.5: bridge io port: [0x00-0xfff]
pci 0000:00:1c.5: bridge 32bit mmio: [0x000000-0x0fffff]
pci 0000:00:1c.5: bridge 64bit mmio pref: [0x000000-0x0fffff]

with patch the increase alignment to 64M should workaround the problem.

the notebooks looks from same vendorfor sky2 driver bug?

looks like they are using [0xb6000000, 0xb8000000) for video ram?
may need to find one way to read those range from HW.
Comment 90 Bernhard 2009-11-08 08:00:42 UTC
Hi. Sorry for not providing feedback sooner, but I am just moving into another flat. Anyway, the patch provided makes the device working. Above the dmesg after boot with CONFIG_PCI_DEBUG=y.

dmesg from 2.6.31.5 stable (unfortunately I forgot to set the "debug" kernel option): 
http://pastebin.com/f68658b89

dmesg from patched 2.6.31.5 kernel with "debug" set in grub: 
http://pastebin.com/f739fdf6b

Bernhard
Comment 91 Luis Chamberlain 2009-11-10 20:51:47 UTC
Someone please close this bug and mark as fixed (and closed) -- this fix is now on 2.6.31.6.
Comment 92 Rafael J. Wysocki 2009-11-10 20:55:23 UTC
Closing.