Bug 16084

Summary: iwl3945 bug in 2.6.34
Product: Drivers Reporter: Maciej Rutecki (maciej.rutecki)
Component: network-wirelessAssignee: Reinette Chatre (reinette.chatre)
Status: CLOSED CODE_FIX    
Severity: normal CC: eerpini, linville, maciej.rutecki, reinette.chatre, rjw
Priority: P1    
Hardware: All   
OS: Linux   
Kernel Version: 2.6.34 Subsystem:
Regression: Yes Bisected commit-id:
Bug Depends on:    
Bug Blocks: 15310    

Description Maciej Rutecki 2010-05-31 17:34:46 UTC
Subject    : iwl3945 bug in 2.6.34
Submitter  : Satish Eerpini <eerpini@gmail.com>
Date       : 2010-05-23 6:37
Message-ID : AANLkTik_7rxDBc0TKlAfoYyM5S6Cf_Hyxbr4W5ORnTsq@mail.gmail.com
References : http://marc.info/?l=linux-kernel&m=127459596015626&w=2

This entry is being used for tracking a regression from 2.6.33.  Please don't
close it until the problem is fixed in the mainline.
Comment 1 Reinette Chatre 2010-06-09 17:39:54 UTC
Patches provided in http://article.gmane.org/gmane.linux.kernel/990443 on 5/25 but user has not yet responded with test results. Pinged user again on 6/3 ( http://article.gmane.org/gmane.linux.kernel/994780 ), still no test results.
Comment 2 Satish 2010-06-10 18:52:41 UTC
I have applied the patches to the 2.6.34 kernel, my machine has been
up for 20 hours now, there is nothing in the dmesg either, dmesg |
grep wlan0 just gives this :

ADDRCONF(NETDEV_UP): wlan0: link is not ready
wlan0: deauthenticating from 00:1b:da:2a:a1:53 by local choice (reason=3)
wlan0: authenticate with 00:1b:da:2a:a1:53 (try 1)
wlan0: authenticated
wlan0: associate with 00:1b:da:2a:a1:53 (try 1)
wlan0: RX AssocResp from 00:1b:da:2a:a1:53 (capab=0x411 status=0 aid=1)
wlan0: associated
ADDRCONF(NETDEV_CHANGE): wlan0: link becomes ready
wlan0: no IPv6 routers present
wlan0: deauthenticating from 00:1b:da:2a:a1:53 by local choice (reason=3)
ADDRCONF(NETDEV_UP): wlan0: link is not ready
wlan0: deauthenticating from 00:1b:da:2a:a1:53 by local choice (reason=3)
wlan0: authenticate with 00:1b:da:2a:a1:53 (try 1)
wlan0: authenticated
wlan0: associate with 00:1b:da:2a:a1:53 (try 1)
wlan0: RX AssocResp from 00:1b:da:2a:a1:53 (capab=0x411 status=0 aid=1)
wlan0: associated
ADDRCONF(NETDEV_CHANGE): wlan0: link becomes ready
wlan0: no IPv6 routers present
wlan0: authenticate with 00:1b:da:2a:a1:53 (try 1)
wlan0: authenticated
wlan0: associate with 00:1b:da:2a:a1:53 (try 1)
wlan0: RX AssocResp from 00:1b:da:2a:a1:53 (capab=0x411 status=0 aid=1)
wlan0: associated
wlan0: authenticate with 00:1b:da:2a:a1:53 (try 1)
wlan0: authenticated
wlan0: associate with 00:1b:da:2a:a1:53 (try 1)
wlan0: RX AssocResp from 00:1b:da:2a:a1:53 (capab=0x411 status=0 aid=1)
wlan0: associated

so I think after applying the patches, things have been working fine.
Comment 3 Satish 2010-06-11 11:40:05 UTC
The 2.6.34 vanilla kernel with the above patches applied crashed again today, I was amidst a conference on webex when this happened, still not able to figure out what is the kind of traffic that causes the driver to give up ... here is output from dmesg : 


cfg80211: Calling CRDA to update world regulatory domain
cfg80211: Calling CRDA for country: IN
cfg80211: Regulatory domain changed to country: IN
    (start_freq - end_freq @ bandwidth), (max_antenna_gain, max_eirp)
    (2402000 KHz - 2482000 KHz @ 40000 KHz), (N/A, 2000 mBm)
    (5170000 KHz - 5250000 KHz @ 20000 KHz), (N/A, 2000 mBm)
    (5250000 KHz - 5330000 KHz @ 20000 KHz), (N/A, 2000 mBm)
    (5735000 KHz - 5835000 KHz @ 20000 KHz), (N/A, 2000 mBm)
wlan0: authenticate with 00:1b:da:2a:a1:53 (try 1)
wlan0: authenticated
wlan0: associate with 00:1b:da:2a:a1:53 (try 1)
wlan0: RX AssocResp from 00:1b:da:2a:a1:53 (capab=0x411 status=0 aid=1)
wlan0: associated
usb 4-2: USB disconnect, address 4
iwl3945 0000:10:00.0: queue 2 stuck 3 time. Fw reload.
iwl3945 0000:10:00.0: On demand firmware reload
iwl3945 0000:10:00.0: MAC is in deep sleep!.  CSR_GP_CNTRL = 0xFFFFFFFF
iwl3945 0000:10:00.0: MAC is in deep sleep!.  CSR_GP_CNTRL = 0xFFFFFFFF
iwl3945 0000:10:00.0: MAC is in deep sleep!.  CSR_GP_CNTRL = 0xFFFFFFFF
iwl3945 0000:10:00.0: MAC is in deep sleep!.  CSR_GP_CNTRL = 0xFFFFFFFF
iwl3945 0000:10:00.0: MAC is in deep sleep!.  CSR_GP_CNTRL = 0xFFFFFFFF
iwl3945 0000:10:00.0: MAC is in deep sleep!.  CSR_GP_CNTRL = 0xFFFFFFFF
iwl3945 0000:10:00.0: MAC is in deep sleep!.  CSR_GP_CNTRL = 0xFFFFFFFF
iwl3945 0000:10:00.0: MAC is in deep sleep!.  CSR_GP_CNTRL = 0xFFFFFFFF
iwl3945 0000:10:00.0: MAC is in deep sleep!.  CSR_GP_CNTRL = 0xFFFFFFFF
iwl3945 0000:10:00.0: MAC is in deep sleep!.  CSR_GP_CNTRL = 0xFFFFFFFF
iwl3945 0000:10:00.0: MAC is in deep sleep!.  CSR_GP_CNTRL = 0xFFFFFFFF
iwl3945 0000:10:00.0: MAC is in deep sleep!.  CSR_GP_CNTRL = 0xFFFFFFFF
iwl3945 0000:10:00.0: MAC is in deep sleep!.  CSR_GP_CNTRL = 0xFFFFFFFF
iwl3945 0000:10:00.0: MAC is in deep sleep!.  CSR_GP_CNTRL = 0xFFFFFFFF
iwl3945 0000:10:00.0: MAC is in deep sleep!.  CSR_GP_CNTRL = 0xFFFFFFFF
iwl3945 0000:10:00.0: MAC is in deep sleep!.  CSR_GP_CNTRL = 0xFFFFFFFF
iwl3945 0000:10:00.0: MAC is in deep sleep!.  CSR_GP_CNTRL = 0xFFFFFFFF
iwl3945 0000:10:00.0: MAC is in deep sleep!.  CSR_GP_CNTRL = 0xFFFFFFFF
iwl3945 0000:10:00.0: MAC is in deep sleep!.  CSR_GP_CNTRL = 0xFFFFFFFF
iwl3945 0000:10:00.0: MAC is in deep sleep!.  CSR_GP_CNTRL = 0xFFFFFFFF
iwl3945 0000:10:00.0: MAC is in deep sleep!.  CSR_GP_CNTRL = 0xFFFFFFFF
iwl3945 0000:10:00.0: MAC is in deep sleep!.  CSR_GP_CNTRL = 0xFFFFFFFF
iwl3945 0000:10:00.0: MAC is in deep sleep!.  CSR_GP_CNTRL = 0xFFFFFFFF
iwl3945 0000:10:00.0: MAC is in deep sleep!.  CSR_GP_CNTRL = 0xFFFFFFFF
iwl3945 0000:10:00.0: MAC is in deep sleep!.  CSR_GP_CNTRL = 0xFFFFFFFF
iwl3945 0000:10:00.0: MAC is in deep sleep!.  CSR_GP_CNTRL = 0xFFFFFFFF
iwl3945 0000:10:00.0: MAC is in deep sleep!.  CSR_GP_CNTRL = 0xFFFFFFFF
iwl3945 0000:10:00.0: MAC is in deep sleep!.  CSR_GP_CNTRL = 0xFFFFFFFF
iwl3945 0000:10:00.0: MAC is in deep sleep!.  CSR_GP_CNTRL = 0xFFFFFFFF
iwl3945 0000:10:00.0: MAC is in deep sleep!.  CSR_GP_CNTRL = 0xFFFFFFFF
iwl3945 0000:10:00.0: MAC is in deep sleep!.  CSR_GP_CNTRL = 0xFFFFFFFF
iwl3945 0000:10:00.0: MAC is in deep sleep!.  CSR_GP_CNTRL = 0xFFFFFFFF
iwl3945 0000:10:00.0: MAC is in deep sleep!.  CSR_GP_CNTRL = 0xFFFFFFFF
iwl3945 0000:10:00.0: MAC is in deep sleep!.  CSR_GP_CNTRL = 0xFFFFFFFF
iwl3945 0000:10:00.0: MAC is in deep sleep!.  CSR_GP_CNTRL = 0xFFFFFFFF
iwl3945 0000:10:00.0: MAC is in deep sleep!.  CSR_GP_CNTRL = 0xFFFFFFFF
iwl3945 0000:10:00.0: MAC is in deep sleep!.  CSR_GP_CNTRL = 0xFFFFFFFF
iwl3945 0000:10:00.0: MAC is in deep sleep!.  CSR_GP_CNTRL = 0xFFFFFFFF
iwl3945 0000:10:00.0: MAC is in deep sleep!.  CSR_GP_CNTRL = 0xFFFFFFFF
iwl3945 0000:10:00.0: MAC is in deep sleep!.  CSR_GP_CNTRL = 0xFFFFFFFF
iwl3945 0000:10:00.0: MAC is in deep sleep!.  CSR_GP_CNTRL = 0xFFFFFFFF
iwl3945 0000:10:00.0: MAC is in deep sleep!.  CSR_GP_CNTRL = 0xFFFFFFFF
iwl3945 0000:10:00.0: MAC is in deep sleep!.  CSR_GP_CNTRL = 0xFFFFFFFF
iwl3945 0000:10:00.0: MAC is in deep sleep!.  CSR_GP_CNTRL = 0xFFFFFFFF
iwl3945 0000:10:00.0: MAC is in deep sleep!.  CSR_GP_CNTRL = 0xFFFFFFFF
iwl3945 0000:10:00.0: MAC is in deep sleep!.  CSR_GP_CNTRL = 0xFFFFFFFF
iwl3945 0000:10:00.0: MAC is in deep sleep!.  CSR_GP_CNTRL = 0xFFFFFFFF
iwl3945 0000:10:00.0: MAC is in deep sleep!.  CSR_GP_CNTRL = 0xFFFFFFFF
iwl3945 0000:10:00.0: MAC is in deep sleep!.  CSR_GP_CNTRL = 0xFFFFFFFF
iwl3945 0000:10:00.0: MAC is in deep sleep!.  CSR_GP_CNTRL = 0xFFFFFFFF
iwl3945 0000:10:00.0: MAC is in deep sleep!.  CSR_GP_CNTRL = 0xFFFFFFFF
iwl3945 0000:10:00.0: MAC is in deep sleep!.  CSR_GP_CNTRL = 0xFFFFFFFF
iwl3945 0000:10:00.0: MAC is in deep sleep!.  CSR_GP_CNTRL = 0xFFFFFFFF
iwl3945 0000:10:00.0: MAC is in deep sleep!.  CSR_GP_CNTRL = 0xFFFFFFFF
iwl3945 0000:10:00.0: MAC is in deep sleep!.  CSR_GP_CNTRL = 0xFFFFFFFF
iwl3945 0000:10:00.0: MAC is in deep sleep!.  CSR_GP_CNTRL = 0xFFFFFFFF
iwl3945 0000:10:00.0: MAC is in deep sleep!.  CSR_GP_CNTRL = 0xFFFFFFFF
iwl3945 0000:10:00.0: MAC is in deep sleep!.  CSR_GP_CNTRL = 0xFFFFFFFF
iwl3945 0000:10:00.0: MAC is in deep sleep!.  CSR_GP_CNTRL = 0xFFFFFFFF
iwl3945 0000:10:00.0: BSM uCode verification failed at addr 0x00003800+0 (of 900), is 0xffffffff, s/b 0xf802020
iwl3945 0000:10:00.0: Wait for START_ALIVE timeout after 2000ms.
No probe response from AP 00:1b:da:2a:a1:53 after 500ms, disconnecting.
cfg80211: Calling CRDA to update world regulatory domain
cfg80211: Calling CRDA for country: IN
cfg80211: Regulatory domain changed to country: IN
    (start_freq - end_freq @ bandwidth), (max_antenna_gain, max_eirp)
    (2402000 KHz - 2482000 KHz @ 40000 KHz), (N/A, 2000 mBm)
    (5170000 KHz - 5250000 KHz @ 20000 KHz), (N/A, 2000 mBm)
    (5250000 KHz - 5330000 KHz @ 20000 KHz), (N/A, 2000 mBm)
    (5735000 KHz - 5835000 KHz @ 20000 KHz), (N/A, 2000 mBm)

is there anything else that can be done ??
Comment 4 Reinette Chatre 2010-06-11 16:26:20 UTC
(In reply to comment #3)
> The 2.6.34 vanilla kernel with the above patches applied crashed again today,
> I
> was amidst a conference on webex when this happened, still not able to figure
> out what is the kind of traffic that causes the driver to give up ... here is
> output from dmesg : 
> 
> 
> cfg80211: Calling CRDA to update world regulatory domain
> cfg80211: Calling CRDA for country: IN
> cfg80211: Regulatory domain changed to country: IN
>     (start_freq - end_freq @ bandwidth), (max_antenna_gain, max_eirp)
>     (2402000 KHz - 2482000 KHz @ 40000 KHz), (N/A, 2000 mBm)
>     (5170000 KHz - 5250000 KHz @ 20000 KHz), (N/A, 2000 mBm)
>     (5250000 KHz - 5330000 KHz @ 20000 KHz), (N/A, 2000 mBm)
>     (5735000 KHz - 5835000 KHz @ 20000 KHz), (N/A, 2000 mBm)
> wlan0: authenticate with 00:1b:da:2a:a1:53 (try 1)
> wlan0: authenticated
> wlan0: associate with 00:1b:da:2a:a1:53 (try 1)
> wlan0: RX AssocResp from 00:1b:da:2a:a1:53 (capab=0x411 status=0 aid=1)
> wlan0: associated
> usb 4-2: USB disconnect, address 4
> iwl3945 0000:10:00.0: queue 2 stuck 3 time. Fw reload.
> iwl3945 0000:10:00.0: On demand firmware reload
> iwl3945 0000:10:00.0: MAC is in deep sleep!.  CSR_GP_CNTRL = 0xFFFFFFFF
> iwl3945 0000:10:00.0: MAC is in deep sleep!.  CSR_GP_CNTRL = 0xFFFFFFFF

The patch appears to have done its work - it detected an unresponsive device and triggered a firmware reload in an effort to recover it. Unfortunately the reason why the device stops responding is because it seems to have been disconnected from the PCI bus, that is why we get all ones when trying to read the registers. This does not seem to be a system power saving component that triggers this since you mention that you were busy using the device when this happened. I do not know at this time how to address this, will look into this more.

In the mean time, could you please keep running with these patches and compile with CONFIG_PRINTK_TIME so that we can get an idea of timing related to this issue?
Comment 5 Satish 2010-06-11 18:00:38 UTC
sure will do that and report back when a crash happens again ...
Comment 6 Reinette Chatre 2010-06-11 18:04:23 UTC
(In reply to comment #5)
> sure will do that and report back when a crash happens again ...

Thank you. I think our driver now needs pci error recovery. This will take a while to implement and I'll get back to you for some testing of it.
Comment 7 Satish 2010-06-11 18:18:54 UTC
> Thank you. I think our driver now needs pci error recovery. This will take a
> while to implement and I'll get back to you for some testing of it.

two questions immediately pop up, why does it need pci error recovery "now" ? as in what regression might have led to this bug, was it something within the iwl drivers or somewhere else.

Sure, will be ready for testing changes.
Comment 8 Reinette Chatre 2010-06-11 18:26:33 UTC
(In reply to comment #7)
> > Thank you. I think our driver now needs pci error recovery. This will take
> a
> > while to implement and I'll get back to you for some testing of it.
> 
> two questions immediately pop up, why does it need pci error recovery "now" ?
> as in what regression might have led to this bug, was it something within the
> iwl drivers or somewhere else.
> 
> Sure, will be ready for testing changes.

The device returning 0xffffffff on register reads has been popping up once in a while and was thought to be power saving related and we had some developers looking into it from this angle. Based on your description this is not the case so we should look elsewhere. I agree that this could be a regression. I see two ways in which to look into this:

* bisect - this may be really hard since from what I understand this is not easy for you to reproduce. That is why I did not even raise the option.
* dig deeper via pci error recovery - I read some more about pci error recovery and it seems to be possible to get notified whenever the device gets disconnected. I am not sure how this works or if it will work in this case, but it is surely worthwhile investigating. Once we know _exactly_ when device gets disconnected (instead of learning of it through a stuck queue like now) then we can dig deeper and find out exactly what driver was doing at that time to perhaps see a trigger.
Comment 9 Satish 2010-06-11 19:09:23 UTC
sure will do that ...


Satish

On Fri, Jun 11, 2010 at 9:56 PM,  <bugzilla-daemon@bugzilla.kernel.org> wrote:
> https://bugzilla.kernel.org/show_bug.cgi?id=16084
>
>
>
>
>
> --- Comment #4 from Reinette Chatre <reinette.chatre@intel.com>  2010-06-11
> 16:26:20 ---
> (In reply to comment #3)
>> The 2.6.34 vanilla kernel with the above patches applied crashed again
>> today, I
>> was amidst a conference on webex when this happened, still not able to
>> figure
>> out what is the kind of traffic that causes the driver to give up ... here
>> is
>> output from dmesg :
>>
>>
>> cfg80211: Calling CRDA to update world regulatory domain
>> cfg80211: Calling CRDA for country: IN
>> cfg80211: Regulatory domain changed to country: IN
>>     (start_freq - end_freq @ bandwidth), (max_antenna_gain, max_eirp)
>>     (2402000 KHz - 2482000 KHz @ 40000 KHz), (N/A, 2000 mBm)
>>     (5170000 KHz - 5250000 KHz @ 20000 KHz), (N/A, 2000 mBm)
>>     (5250000 KHz - 5330000 KHz @ 20000 KHz), (N/A, 2000 mBm)
>>     (5735000 KHz - 5835000 KHz @ 20000 KHz), (N/A, 2000 mBm)
>> wlan0: authenticate with 00:1b:da:2a:a1:53 (try 1)
>> wlan0: authenticated
>> wlan0: associate with 00:1b:da:2a:a1:53 (try 1)
>> wlan0: RX AssocResp from 00:1b:da:2a:a1:53 (capab=0x411 status=0 aid=1)
>> wlan0: associated
>> usb 4-2: USB disconnect, address 4
>> iwl3945 0000:10:00.0: queue 2 stuck 3 time. Fw reload.
>> iwl3945 0000:10:00.0: On demand firmware reload
>> iwl3945 0000:10:00.0: MAC is in deep sleep!.  CSR_GP_CNTRL = 0xFFFFFFFF
>> iwl3945 0000:10:00.0: MAC is in deep sleep!.  CSR_GP_CNTRL = 0xFFFFFFFF
>
> The patch appears to have done its work - it detected an unresponsive device
> and triggered a firmware reload in an effort to recover it. Unfortunately the
> reason why the device stops responding is because it seems to have been
> disconnected from the PCI bus, that is why we get all ones when trying to
> read
> the registers. This does not seem to be a system power saving component that
> triggers this since you mention that you were busy using the device when this
> happened. I do not know at this time how to address this, will look into this
> more.
>
> In the mean time, could you please keep running with these patches and
> compile
> with CONFIG_PRINTK_TIME so that we can get an idea of timing related to this
> issue?
>
> --
> Configure bugmail: https://bugzilla.kernel.org/userprefs.cgi?tab=email
> ------- You are receiving this mail because: -------
> You are on the CC list for the bug.
>



-- 
http://satisheerpini.net
Comment 10 Satish 2010-06-11 20:12:10 UTC
> The device returning 0xffffffff on register reads has been popping up once in
> a
> while and was thought to be power saving related and we had some developers
> looking into it from this angle. Based on your description this is not the
> case
> so we should look elsewhere. I agree that this could be a regression. I see
> two
> ways in which to look into this:
> 
> * bisect - this may be really hard since from what I understand this is not
> easy for you to reproduce. That is why I did not even raise the option.
> * dig deeper via pci error recovery - I read some more about pci error
> recovery
> and it seems to be possible to get notified whenever the device gets
> disconnected. I am not sure how this works or if it will work in this case,
> but
> it is surely worthwhile investigating. Once we know _exactly_ when device
> gets
> disconnected (instead of learning of it through a stuck queue like now) then
> we
> can dig deeper and find out exactly what driver was doing at that time to
> perhaps see a trigger.

ok, so is there a way we can force the PCI bus to disconnect to see if the pci error recovery code is working ? writing error_detected, slot_reset and resume should be easy, the reason I am not able to write these is because I don know what all has to be done to like reinitialize the card in slot_reset(), or cancel all PCI I/O in error_detected() and then set the error state, or what all operations need to be done to resume PCI I/O in resume(). Looking at net/s2io.c is helpful I guess, it has pci_error_handlers implemented. 

I have been reading the iwl3945-base.c code, will post a patch if I can implement and test pci error recovery for the card.
Comment 11 Reinette Chatre 2010-06-15 17:38:16 UTC
I inquired a bit more about this and I was told that the mobile chipsets do not support AER (see pcieaer-howto.txt) ... so unfortunately we cannot use these callbacks. I am looking if it is possible to use our "0xffffffff register read" as a crude error detector and then use the reset calls from these drivers.
Comment 12 Reinette Chatre 2010-06-21 21:38:44 UTC
Is the PCI config space still accessible when the device gets into this state?
That is, when you start seeing the "MAC is in deep sleep!.  CSR_GP_CNTRL =
0xFFFFFFFF"? Could you please check the device id with lspci and then run
"lspci -s <device> -xxxx" when this error occurs?
Comment 13 Satish 2010-06-22 21:32:50 UTC
the last time I reported the bug was the last time it occurred, have not been able to reproduce it after that. I have been running the same 2.6.34 kernel with your earlies two patches. Interestingly the latest fedora 13 kernels crash more frequently that what I saw even with the vanilla 2.6.34 kernel. 
Is it ok if I try what you mentioned above on the fedora kernel which is crashing ?
Comment 14 Reinette Chatre 2010-06-22 21:46:36 UTC
Seems that we are tracking two issues here. 

Do you believe that the two patches I sent you earlier are addressing the issue you reported initially? If so, then I need to get them into the hands of stable.

As far as the second issue goes ... I'd really like more information about it, I don't really care which kernel you are using. Is the Fedora kernel crashing with the "MAC is in deep sleep!.  CSR_GP_CNTRL = 0xFFFFFFFF" message? That is what I am looking into now and would like more information on as requested in comment #12. If the Fedora kernel is crashing in some other way then please open a new bug report, which should probably go to the Fedora bugzilla.
Comment 15 Reinette Chatre 2010-06-28 22:51:32 UTC
I think we should keep better track of this. Since the patches are addressing your original issue I am sending it to stable for inclusion into 2.6.34. Since this bug is tracked as a regression it will make things easier to track if we close this bug now and not pollute it with more issues.

Could you please open a new bug report for the second issue when you encounter it again?
Comment 16 Satish 2010-06-28 23:34:50 UTC
sure , will do that. So this is closed now.