Bug 16084
Summary: | iwl3945 bug in 2.6.34 | ||
---|---|---|---|
Product: | Drivers | Reporter: | Maciej Rutecki (maciej.rutecki) |
Component: | network-wireless | Assignee: | Reinette Chatre (reinette.chatre) |
Status: | CLOSED CODE_FIX | ||
Severity: | normal | CC: | eerpini, linville, maciej.rutecki, reinette.chatre, rjw |
Priority: | P1 | ||
Hardware: | All | ||
OS: | Linux | ||
Kernel Version: | 2.6.34 | Subsystem: | |
Regression: | Yes | Bisected commit-id: | |
Bug Depends on: | |||
Bug Blocks: | 15310 |
Description
Maciej Rutecki
2010-05-31 17:34:46 UTC
Patches provided in http://article.gmane.org/gmane.linux.kernel/990443 on 5/25 but user has not yet responded with test results. Pinged user again on 6/3 ( http://article.gmane.org/gmane.linux.kernel/994780 ), still no test results. I have applied the patches to the 2.6.34 kernel, my machine has been up for 20 hours now, there is nothing in the dmesg either, dmesg | grep wlan0 just gives this : ADDRCONF(NETDEV_UP): wlan0: link is not ready wlan0: deauthenticating from 00:1b:da:2a:a1:53 by local choice (reason=3) wlan0: authenticate with 00:1b:da:2a:a1:53 (try 1) wlan0: authenticated wlan0: associate with 00:1b:da:2a:a1:53 (try 1) wlan0: RX AssocResp from 00:1b:da:2a:a1:53 (capab=0x411 status=0 aid=1) wlan0: associated ADDRCONF(NETDEV_CHANGE): wlan0: link becomes ready wlan0: no IPv6 routers present wlan0: deauthenticating from 00:1b:da:2a:a1:53 by local choice (reason=3) ADDRCONF(NETDEV_UP): wlan0: link is not ready wlan0: deauthenticating from 00:1b:da:2a:a1:53 by local choice (reason=3) wlan0: authenticate with 00:1b:da:2a:a1:53 (try 1) wlan0: authenticated wlan0: associate with 00:1b:da:2a:a1:53 (try 1) wlan0: RX AssocResp from 00:1b:da:2a:a1:53 (capab=0x411 status=0 aid=1) wlan0: associated ADDRCONF(NETDEV_CHANGE): wlan0: link becomes ready wlan0: no IPv6 routers present wlan0: authenticate with 00:1b:da:2a:a1:53 (try 1) wlan0: authenticated wlan0: associate with 00:1b:da:2a:a1:53 (try 1) wlan0: RX AssocResp from 00:1b:da:2a:a1:53 (capab=0x411 status=0 aid=1) wlan0: associated wlan0: authenticate with 00:1b:da:2a:a1:53 (try 1) wlan0: authenticated wlan0: associate with 00:1b:da:2a:a1:53 (try 1) wlan0: RX AssocResp from 00:1b:da:2a:a1:53 (capab=0x411 status=0 aid=1) wlan0: associated so I think after applying the patches, things have been working fine. The 2.6.34 vanilla kernel with the above patches applied crashed again today, I was amidst a conference on webex when this happened, still not able to figure out what is the kind of traffic that causes the driver to give up ... here is output from dmesg : cfg80211: Calling CRDA to update world regulatory domain cfg80211: Calling CRDA for country: IN cfg80211: Regulatory domain changed to country: IN (start_freq - end_freq @ bandwidth), (max_antenna_gain, max_eirp) (2402000 KHz - 2482000 KHz @ 40000 KHz), (N/A, 2000 mBm) (5170000 KHz - 5250000 KHz @ 20000 KHz), (N/A, 2000 mBm) (5250000 KHz - 5330000 KHz @ 20000 KHz), (N/A, 2000 mBm) (5735000 KHz - 5835000 KHz @ 20000 KHz), (N/A, 2000 mBm) wlan0: authenticate with 00:1b:da:2a:a1:53 (try 1) wlan0: authenticated wlan0: associate with 00:1b:da:2a:a1:53 (try 1) wlan0: RX AssocResp from 00:1b:da:2a:a1:53 (capab=0x411 status=0 aid=1) wlan0: associated usb 4-2: USB disconnect, address 4 iwl3945 0000:10:00.0: queue 2 stuck 3 time. Fw reload. iwl3945 0000:10:00.0: On demand firmware reload iwl3945 0000:10:00.0: MAC is in deep sleep!. CSR_GP_CNTRL = 0xFFFFFFFF iwl3945 0000:10:00.0: MAC is in deep sleep!. CSR_GP_CNTRL = 0xFFFFFFFF iwl3945 0000:10:00.0: MAC is in deep sleep!. CSR_GP_CNTRL = 0xFFFFFFFF iwl3945 0000:10:00.0: MAC is in deep sleep!. CSR_GP_CNTRL = 0xFFFFFFFF iwl3945 0000:10:00.0: MAC is in deep sleep!. CSR_GP_CNTRL = 0xFFFFFFFF iwl3945 0000:10:00.0: MAC is in deep sleep!. CSR_GP_CNTRL = 0xFFFFFFFF iwl3945 0000:10:00.0: MAC is in deep sleep!. CSR_GP_CNTRL = 0xFFFFFFFF iwl3945 0000:10:00.0: MAC is in deep sleep!. CSR_GP_CNTRL = 0xFFFFFFFF iwl3945 0000:10:00.0: MAC is in deep sleep!. CSR_GP_CNTRL = 0xFFFFFFFF iwl3945 0000:10:00.0: MAC is in deep sleep!. CSR_GP_CNTRL = 0xFFFFFFFF iwl3945 0000:10:00.0: MAC is in deep sleep!. CSR_GP_CNTRL = 0xFFFFFFFF iwl3945 0000:10:00.0: MAC is in deep sleep!. CSR_GP_CNTRL = 0xFFFFFFFF iwl3945 0000:10:00.0: MAC is in deep sleep!. CSR_GP_CNTRL = 0xFFFFFFFF iwl3945 0000:10:00.0: MAC is in deep sleep!. CSR_GP_CNTRL = 0xFFFFFFFF iwl3945 0000:10:00.0: MAC is in deep sleep!. CSR_GP_CNTRL = 0xFFFFFFFF iwl3945 0000:10:00.0: MAC is in deep sleep!. CSR_GP_CNTRL = 0xFFFFFFFF iwl3945 0000:10:00.0: MAC is in deep sleep!. CSR_GP_CNTRL = 0xFFFFFFFF iwl3945 0000:10:00.0: MAC is in deep sleep!. CSR_GP_CNTRL = 0xFFFFFFFF iwl3945 0000:10:00.0: MAC is in deep sleep!. CSR_GP_CNTRL = 0xFFFFFFFF iwl3945 0000:10:00.0: MAC is in deep sleep!. CSR_GP_CNTRL = 0xFFFFFFFF iwl3945 0000:10:00.0: MAC is in deep sleep!. CSR_GP_CNTRL = 0xFFFFFFFF iwl3945 0000:10:00.0: MAC is in deep sleep!. CSR_GP_CNTRL = 0xFFFFFFFF iwl3945 0000:10:00.0: MAC is in deep sleep!. CSR_GP_CNTRL = 0xFFFFFFFF iwl3945 0000:10:00.0: MAC is in deep sleep!. CSR_GP_CNTRL = 0xFFFFFFFF iwl3945 0000:10:00.0: MAC is in deep sleep!. CSR_GP_CNTRL = 0xFFFFFFFF iwl3945 0000:10:00.0: MAC is in deep sleep!. CSR_GP_CNTRL = 0xFFFFFFFF iwl3945 0000:10:00.0: MAC is in deep sleep!. CSR_GP_CNTRL = 0xFFFFFFFF iwl3945 0000:10:00.0: MAC is in deep sleep!. CSR_GP_CNTRL = 0xFFFFFFFF iwl3945 0000:10:00.0: MAC is in deep sleep!. CSR_GP_CNTRL = 0xFFFFFFFF iwl3945 0000:10:00.0: MAC is in deep sleep!. CSR_GP_CNTRL = 0xFFFFFFFF iwl3945 0000:10:00.0: MAC is in deep sleep!. CSR_GP_CNTRL = 0xFFFFFFFF iwl3945 0000:10:00.0: MAC is in deep sleep!. CSR_GP_CNTRL = 0xFFFFFFFF iwl3945 0000:10:00.0: MAC is in deep sleep!. CSR_GP_CNTRL = 0xFFFFFFFF iwl3945 0000:10:00.0: MAC is in deep sleep!. CSR_GP_CNTRL = 0xFFFFFFFF iwl3945 0000:10:00.0: MAC is in deep sleep!. CSR_GP_CNTRL = 0xFFFFFFFF iwl3945 0000:10:00.0: MAC is in deep sleep!. CSR_GP_CNTRL = 0xFFFFFFFF iwl3945 0000:10:00.0: MAC is in deep sleep!. CSR_GP_CNTRL = 0xFFFFFFFF iwl3945 0000:10:00.0: MAC is in deep sleep!. CSR_GP_CNTRL = 0xFFFFFFFF iwl3945 0000:10:00.0: MAC is in deep sleep!. CSR_GP_CNTRL = 0xFFFFFFFF iwl3945 0000:10:00.0: MAC is in deep sleep!. CSR_GP_CNTRL = 0xFFFFFFFF iwl3945 0000:10:00.0: MAC is in deep sleep!. CSR_GP_CNTRL = 0xFFFFFFFF iwl3945 0000:10:00.0: MAC is in deep sleep!. CSR_GP_CNTRL = 0xFFFFFFFF iwl3945 0000:10:00.0: MAC is in deep sleep!. CSR_GP_CNTRL = 0xFFFFFFFF iwl3945 0000:10:00.0: MAC is in deep sleep!. CSR_GP_CNTRL = 0xFFFFFFFF iwl3945 0000:10:00.0: MAC is in deep sleep!. CSR_GP_CNTRL = 0xFFFFFFFF iwl3945 0000:10:00.0: MAC is in deep sleep!. CSR_GP_CNTRL = 0xFFFFFFFF iwl3945 0000:10:00.0: MAC is in deep sleep!. CSR_GP_CNTRL = 0xFFFFFFFF iwl3945 0000:10:00.0: MAC is in deep sleep!. CSR_GP_CNTRL = 0xFFFFFFFF iwl3945 0000:10:00.0: MAC is in deep sleep!. CSR_GP_CNTRL = 0xFFFFFFFF iwl3945 0000:10:00.0: MAC is in deep sleep!. CSR_GP_CNTRL = 0xFFFFFFFF iwl3945 0000:10:00.0: MAC is in deep sleep!. CSR_GP_CNTRL = 0xFFFFFFFF iwl3945 0000:10:00.0: MAC is in deep sleep!. CSR_GP_CNTRL = 0xFFFFFFFF iwl3945 0000:10:00.0: MAC is in deep sleep!. CSR_GP_CNTRL = 0xFFFFFFFF iwl3945 0000:10:00.0: MAC is in deep sleep!. CSR_GP_CNTRL = 0xFFFFFFFF iwl3945 0000:10:00.0: MAC is in deep sleep!. CSR_GP_CNTRL = 0xFFFFFFFF iwl3945 0000:10:00.0: MAC is in deep sleep!. CSR_GP_CNTRL = 0xFFFFFFFF iwl3945 0000:10:00.0: MAC is in deep sleep!. CSR_GP_CNTRL = 0xFFFFFFFF iwl3945 0000:10:00.0: MAC is in deep sleep!. CSR_GP_CNTRL = 0xFFFFFFFF iwl3945 0000:10:00.0: MAC is in deep sleep!. CSR_GP_CNTRL = 0xFFFFFFFF iwl3945 0000:10:00.0: BSM uCode verification failed at addr 0x00003800+0 (of 900), is 0xffffffff, s/b 0xf802020 iwl3945 0000:10:00.0: Wait for START_ALIVE timeout after 2000ms. No probe response from AP 00:1b:da:2a:a1:53 after 500ms, disconnecting. cfg80211: Calling CRDA to update world regulatory domain cfg80211: Calling CRDA for country: IN cfg80211: Regulatory domain changed to country: IN (start_freq - end_freq @ bandwidth), (max_antenna_gain, max_eirp) (2402000 KHz - 2482000 KHz @ 40000 KHz), (N/A, 2000 mBm) (5170000 KHz - 5250000 KHz @ 20000 KHz), (N/A, 2000 mBm) (5250000 KHz - 5330000 KHz @ 20000 KHz), (N/A, 2000 mBm) (5735000 KHz - 5835000 KHz @ 20000 KHz), (N/A, 2000 mBm) is there anything else that can be done ?? (In reply to comment #3) > The 2.6.34 vanilla kernel with the above patches applied crashed again today, > I > was amidst a conference on webex when this happened, still not able to figure > out what is the kind of traffic that causes the driver to give up ... here is > output from dmesg : > > > cfg80211: Calling CRDA to update world regulatory domain > cfg80211: Calling CRDA for country: IN > cfg80211: Regulatory domain changed to country: IN > (start_freq - end_freq @ bandwidth), (max_antenna_gain, max_eirp) > (2402000 KHz - 2482000 KHz @ 40000 KHz), (N/A, 2000 mBm) > (5170000 KHz - 5250000 KHz @ 20000 KHz), (N/A, 2000 mBm) > (5250000 KHz - 5330000 KHz @ 20000 KHz), (N/A, 2000 mBm) > (5735000 KHz - 5835000 KHz @ 20000 KHz), (N/A, 2000 mBm) > wlan0: authenticate with 00:1b:da:2a:a1:53 (try 1) > wlan0: authenticated > wlan0: associate with 00:1b:da:2a:a1:53 (try 1) > wlan0: RX AssocResp from 00:1b:da:2a:a1:53 (capab=0x411 status=0 aid=1) > wlan0: associated > usb 4-2: USB disconnect, address 4 > iwl3945 0000:10:00.0: queue 2 stuck 3 time. Fw reload. > iwl3945 0000:10:00.0: On demand firmware reload > iwl3945 0000:10:00.0: MAC is in deep sleep!. CSR_GP_CNTRL = 0xFFFFFFFF > iwl3945 0000:10:00.0: MAC is in deep sleep!. CSR_GP_CNTRL = 0xFFFFFFFF The patch appears to have done its work - it detected an unresponsive device and triggered a firmware reload in an effort to recover it. Unfortunately the reason why the device stops responding is because it seems to have been disconnected from the PCI bus, that is why we get all ones when trying to read the registers. This does not seem to be a system power saving component that triggers this since you mention that you were busy using the device when this happened. I do not know at this time how to address this, will look into this more. In the mean time, could you please keep running with these patches and compile with CONFIG_PRINTK_TIME so that we can get an idea of timing related to this issue? sure will do that and report back when a crash happens again ... (In reply to comment #5) > sure will do that and report back when a crash happens again ... Thank you. I think our driver now needs pci error recovery. This will take a while to implement and I'll get back to you for some testing of it.
> Thank you. I think our driver now needs pci error recovery. This will take a
> while to implement and I'll get back to you for some testing of it.
two questions immediately pop up, why does it need pci error recovery "now" ? as in what regression might have led to this bug, was it something within the iwl drivers or somewhere else.
Sure, will be ready for testing changes.
(In reply to comment #7) > > Thank you. I think our driver now needs pci error recovery. This will take > a > > while to implement and I'll get back to you for some testing of it. > > two questions immediately pop up, why does it need pci error recovery "now" ? > as in what regression might have led to this bug, was it something within the > iwl drivers or somewhere else. > > Sure, will be ready for testing changes. The device returning 0xffffffff on register reads has been popping up once in a while and was thought to be power saving related and we had some developers looking into it from this angle. Based on your description this is not the case so we should look elsewhere. I agree that this could be a regression. I see two ways in which to look into this: * bisect - this may be really hard since from what I understand this is not easy for you to reproduce. That is why I did not even raise the option. * dig deeper via pci error recovery - I read some more about pci error recovery and it seems to be possible to get notified whenever the device gets disconnected. I am not sure how this works or if it will work in this case, but it is surely worthwhile investigating. Once we know _exactly_ when device gets disconnected (instead of learning of it through a stuck queue like now) then we can dig deeper and find out exactly what driver was doing at that time to perhaps see a trigger. sure will do that ... Satish On Fri, Jun 11, 2010 at 9:56 PM, <bugzilla-daemon@bugzilla.kernel.org> wrote: > https://bugzilla.kernel.org/show_bug.cgi?id=16084 > > > > > > --- Comment #4 from Reinette Chatre <reinette.chatre@intel.com> 2010-06-11 > 16:26:20 --- > (In reply to comment #3) >> The 2.6.34 vanilla kernel with the above patches applied crashed again >> today, I >> was amidst a conference on webex when this happened, still not able to >> figure >> out what is the kind of traffic that causes the driver to give up ... here >> is >> output from dmesg : >> >> >> cfg80211: Calling CRDA to update world regulatory domain >> cfg80211: Calling CRDA for country: IN >> cfg80211: Regulatory domain changed to country: IN >> (start_freq - end_freq @ bandwidth), (max_antenna_gain, max_eirp) >> (2402000 KHz - 2482000 KHz @ 40000 KHz), (N/A, 2000 mBm) >> (5170000 KHz - 5250000 KHz @ 20000 KHz), (N/A, 2000 mBm) >> (5250000 KHz - 5330000 KHz @ 20000 KHz), (N/A, 2000 mBm) >> (5735000 KHz - 5835000 KHz @ 20000 KHz), (N/A, 2000 mBm) >> wlan0: authenticate with 00:1b:da:2a:a1:53 (try 1) >> wlan0: authenticated >> wlan0: associate with 00:1b:da:2a:a1:53 (try 1) >> wlan0: RX AssocResp from 00:1b:da:2a:a1:53 (capab=0x411 status=0 aid=1) >> wlan0: associated >> usb 4-2: USB disconnect, address 4 >> iwl3945 0000:10:00.0: queue 2 stuck 3 time. Fw reload. >> iwl3945 0000:10:00.0: On demand firmware reload >> iwl3945 0000:10:00.0: MAC is in deep sleep!. CSR_GP_CNTRL = 0xFFFFFFFF >> iwl3945 0000:10:00.0: MAC is in deep sleep!. CSR_GP_CNTRL = 0xFFFFFFFF > > The patch appears to have done its work - it detected an unresponsive device > and triggered a firmware reload in an effort to recover it. Unfortunately the > reason why the device stops responding is because it seems to have been > disconnected from the PCI bus, that is why we get all ones when trying to > read > the registers. This does not seem to be a system power saving component that > triggers this since you mention that you were busy using the device when this > happened. I do not know at this time how to address this, will look into this > more. > > In the mean time, could you please keep running with these patches and > compile > with CONFIG_PRINTK_TIME so that we can get an idea of timing related to this > issue? > > -- > Configure bugmail: https://bugzilla.kernel.org/userprefs.cgi?tab=email > ------- You are receiving this mail because: ------- > You are on the CC list for the bug. > -- http://satisheerpini.net
> The device returning 0xffffffff on register reads has been popping up once in
> a
> while and was thought to be power saving related and we had some developers
> looking into it from this angle. Based on your description this is not the
> case
> so we should look elsewhere. I agree that this could be a regression. I see
> two
> ways in which to look into this:
>
> * bisect - this may be really hard since from what I understand this is not
> easy for you to reproduce. That is why I did not even raise the option.
> * dig deeper via pci error recovery - I read some more about pci error
> recovery
> and it seems to be possible to get notified whenever the device gets
> disconnected. I am not sure how this works or if it will work in this case,
> but
> it is surely worthwhile investigating. Once we know _exactly_ when device
> gets
> disconnected (instead of learning of it through a stuck queue like now) then
> we
> can dig deeper and find out exactly what driver was doing at that time to
> perhaps see a trigger.
ok, so is there a way we can force the PCI bus to disconnect to see if the pci error recovery code is working ? writing error_detected, slot_reset and resume should be easy, the reason I am not able to write these is because I don know what all has to be done to like reinitialize the card in slot_reset(), or cancel all PCI I/O in error_detected() and then set the error state, or what all operations need to be done to resume PCI I/O in resume(). Looking at net/s2io.c is helpful I guess, it has pci_error_handlers implemented.
I have been reading the iwl3945-base.c code, will post a patch if I can implement and test pci error recovery for the card.
I inquired a bit more about this and I was told that the mobile chipsets do not support AER (see pcieaer-howto.txt) ... so unfortunately we cannot use these callbacks. I am looking if it is possible to use our "0xffffffff register read" as a crude error detector and then use the reset calls from these drivers. Is the PCI config space still accessible when the device gets into this state? That is, when you start seeing the "MAC is in deep sleep!. CSR_GP_CNTRL = 0xFFFFFFFF"? Could you please check the device id with lspci and then run "lspci -s <device> -xxxx" when this error occurs? the last time I reported the bug was the last time it occurred, have not been able to reproduce it after that. I have been running the same 2.6.34 kernel with your earlies two patches. Interestingly the latest fedora 13 kernels crash more frequently that what I saw even with the vanilla 2.6.34 kernel. Is it ok if I try what you mentioned above on the fedora kernel which is crashing ? Seems that we are tracking two issues here. Do you believe that the two patches I sent you earlier are addressing the issue you reported initially? If so, then I need to get them into the hands of stable. As far as the second issue goes ... I'd really like more information about it, I don't really care which kernel you are using. Is the Fedora kernel crashing with the "MAC is in deep sleep!. CSR_GP_CNTRL = 0xFFFFFFFF" message? That is what I am looking into now and would like more information on as requested in comment #12. If the Fedora kernel is crashing in some other way then please open a new bug report, which should probably go to the Fedora bugzilla. I think we should keep better track of this. Since the patches are addressing your original issue I am sending it to stable for inclusion into 2.6.34. Since this bug is tracked as a regression it will make things easier to track if we close this bug now and not pollute it with more issues. Could you please open a new bug report for the second issue when you encounter it again? sure , will do that. So this is closed now. |