Bug 15419 - ipw2200: requesting scan causes connection drops
Summary: ipw2200: requesting scan causes connection drops
Status: CLOSED CODE_FIX
Alias: None
Product: Drivers
Classification: Unclassified
Component: network-wireless (show other bugs)
Hardware: All Linux
: P1 normal
Assignee: drivers_network-wireless@kernel-bugs.osdl.org
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2010-03-01 17:26 UTC by Maurizio Avogadro
Modified: 2010-08-18 19:21 UTC (History)
2 users (show)

See Also:
Kernel Version: 2.6.32.9
Subsystem:
Regression: No
Bisected commit-id:


Attachments
Commented dmesg (ipw2200 debug=255) (69.03 KB, text/plain)
2010-03-01 17:26 UTC, Maurizio Avogadro
Details
2 more logs (19.63 KB, application/x-gzip)
2010-03-10 14:40 UTC, Maurizio Avogadro
Details
logs [debug=0x800 + IPW_SCAN_CHECK_WATCHDOG (10 * HZ)] (5.94 KB, application/x-gzip)
2010-03-12 09:33 UTC, Maurizio Avogadro
Details
a patch to try (1.13 KB, patch)
2010-03-16 02:42 UTC, Zhu Yi
Details | Diff

Description Maurizio Avogadro 2010-03-01 17:26:30 UTC
Created attachment 25292 [details]
Commented dmesg (ipw2200 debug=255)

Overview:
Since long time (at least 2.6.24), with some networks the connection gets systematically dropped when a scan is requested (i.e. by NetworkManager).

Steps to reproduce:
I could always reproduce this issue by stopping NetworkManager and manually configuring a wireless connection to an affected network in /etc/network/interfaces (ifupdown uses wpa_supplicant, like NM does): when the connection is active, a iwlist ethx scan command always causes ap disassociation and subsequent connection drop. If no scans are requested the connection is very reliable and fast and lasts hours (no disassociations happen).

Actual results:
The card gets disassociated from the ap and the connection drops.

Expected results:
The connection shouldn't drop when scanning is requested.

Build date and platform:
Debian Sid i386 current (kernel 2.6.32-9) on an Acer TravelMate 435LMi + Intel ProWireless 2915 (ipw2200).

Additional info:
- wireless device lspci -vvv:
02:02.0 Network controller: Intel Corporation PRO/Wireless 2915ABG [Calexico2] Network Connection (rev 05)
        Subsystem: Intel Corporation Device 1001
        Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR+ FastB2B- DisINTx-
        Status: Cap+ 66MHz- UDF- FastB2B+ ParErr- DEVSEL=medium >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
        Latency: 32 (750ns min, 6000ns max), Cache Line Size: 32 bytes
        Interrupt: pin A routed to IRQ 18
        Region 0: Memory at d0200000 (32-bit, non-prefetchable) [size=4K]
        Capabilities: [dc] Power Management version 2
                Flags: PMEClk- DSI+ D1- D2- AuxCurrent=0mA PME(D0+,D1-,D2-,D3hot+,D3cold+)
                Status: D0 NoSoftRst- PME-Enable- DSel=0 DScale=1 PME-
        Kernel driver in use: ipw2200

- note:
The logfile I'm attaching refers to a connection to a Linksys WAG325N: the ap was 3mt far from the laptop, no matter whether the 802.11n features were enabled or not, whether the ESSID was broadcasted or not and regardless of used encryption (RSN or WPA).
The ipw2200 module was loaded with no additional parameters.

Thanks

Maurizio
Comment 1 John W. Linville 2010-03-01 17:59:50 UTC
This is likely caused by the scan spending too much time away from the operational channel.  It looks like the scanning is done in the firmware on the device, so I'm not sure how much can be done.  Perhaps the dwell time can be tweaked a bit.  Does the behaviour change if you using "iwlist eth1 scan essid blah" (where blah is a specific ssid in your area)?
Comment 2 Zhu Yi 2010-03-02 02:08:13 UTC
> ipw2200: U ipw_wx_get_essid Getting essid: 'sayonara'
> ipw2200: U ipw_wx_get_wap Getting WAP BSSID: 00:1a:70:a2:d7:94
> ipw2200: U ipw_wx_get_nick Getting nick
> ipw2200: U ipw_wx_get_rate GET Rate -> 54000000 
> ipw2200: U ipw_wx_get_mode Get MODE -> 2
> ipw2200: U ipw_deinit Aborting scan during shutdown.
> ipw2200: U ipw_deinit Disassociating during shutdown.
> ipw2200: U ipw_send_disassociate Disassocation attempt from 00:1a:70:a2:d7:94
> on channel 13.

Looks like the disassociation is caused by shutdown. I didn't find where it is triggered by scan. BTW, is it a regression?
Comment 3 Maurizio Avogadro 2010-03-02 09:59:22 UTC
(In reply to comment #1)
> This is likely caused by the scan spending too much time away from the
> operational channel.  It looks like the scanning is done in the firmware on
> the
> device, so I'm not sure how much can be done.  Perhaps the dwell time can be
> tweaked a bit.  Does the behaviour change if you using "iwlist eth1 scan
> essid
> blah" (where blah is a specific ssid in your area)?

Unfortunately it seems that specifying an existing essid doesn't change things.
Comment 4 Maurizio Avogadro 2010-03-02 10:28:13 UTC
(In reply to comment #2)
> Looks like the disassociation is caused by shutdown. I didn't find where it
> is
> triggered by scan. BTW, is it a regression?

As I wrote I'm experiencing frequent disconnects with this card at least since kernel 2.6.24 but I could find a possible cause only recently. Furthermore, Network Manager might have implemented roaming support at that time, thus making evident an already existing issue.

Please also notice that this issue doesn't show up with any network/environment: I can reliably and steadily connect to some other APs by using Network Manager (i.e. a Linksys WRT-54GL AP running DD-WRT v24, RSN/PSK hidden ESSID); the AP used in the example works flawlessly with other hardware (tested with some Atheros and Broadcom based laptop cards and with some wireless capable Nokia phones).
Comment 5 Maurizio Avogadro 2010-03-04 18:15:55 UTC
Excluding 802.11a with 'iwpriv eth1 set_mode 6' improves connection reliability with NetworkManager, at least with cited WAG325N ap: as far as I can see disconnects are far less frequent, but still happen from time to time.

This seems to confirm John's hypothesis, but doesn't explain why this happens only with some aps.

I noticed that iwspy often shows the same 'Scan request completed' message many times after a single 'iwlist eth1 scan' command: is this behaviour expected?

[ifup eth1]
18:57:51.668521   eth1     Scan request completed
18:57:51.669416   eth1     Set Mode:Managed
18:57:51.669465   eth1     Set Frequency:2.472 GHz (Channel 13)
18:57:51.689680   eth1     New Access Point/Cell address:00:1A:70:A2:D7:94

[iwlist eth1 scan]
18:58:28.053119   eth1     Scan request completed
18:58:33.427150   eth1     Scan request completed

[iwlist eth1 scan]
18:59:12.212311   eth1     Scan request completed
18:59:17.615095   eth1     Scan request completed
18:59:23.618718   eth1     Scan request completed

Is there something more I can do to help investigating this issue?
Comment 6 Zhu Yi 2010-03-08 01:24:08 UTC
NM can initiate scan as well. Do you have a dmesg with the disassociate reason?
Comment 7 Maurizio Avogadro 2010-03-08 02:32:19 UTC
(In reply to comment #6)
> NM can initiate scan as well. Do you have a dmesg with the disassociate
> reason?

Unfortunately, NM does request a background scan every 2 min. max (pls. see https://bugzilla.gnome.org/show_bug.cgi?id=513820). In our case, this means a connection drop every 2 minutes.

Please tell me how to get a more detailed dmesg.
Comment 8 Zhu Yi 2010-03-08 03:19:32 UTC
(In reply to comment #7)
> Please tell me how to get a more detailed dmesg.

Just capture the dmesg into a file _while_ the STA disassociated (due to scan). The one you provided looks like it STA is disassociated due to network shutdown.
Comment 9 Maurizio Avogadro 2010-03-08 23:52:41 UTC
(In reply to comment #8)
> The one you provided looks like it STA is disassociated due to network
> shutdown.

_No_ network shutdowns happened during the dmesg capture I attached (you mean the AP, which I manage by myself, went down, right? Well, when I captured that log at least another client was connected to the AP, and its connection didn't drop). The only commands I sent were the ones you can find in the attached log, and no parts of the dmesg have been omitted.

In the very same conditions, by using ifupdown instead of NetworkManager, the connection with that AP stays up many hours, until I shut my laptop down, without a single drop (since I found the trick, everytime I need to connect to that AP I disable NM and manually configure the connection with ifupdown; as soon as I try to send an 'iwlist eth1 scan' command the connection drops and after a few seconds is automatically reestablished).

Since now, all the logs I could obtain were very similar to the one I attached; I'll get some more logs, in bg only mode too, and I'll post them along with the command which produced them.
Comment 10 Zhu Yi 2010-03-09 01:13:40 UTC
From comment #2, ipw_down is called. That means someone (ie. NM) is closing the interface. To avoid NM's misleading of this problem, can you disable it and attach the dmesg with only iwconfig and iwlist scan?
Comment 11 Maurizio Avogadro 2010-03-10 14:39:14 UTC
Adding some more logs - with _no_ encryption the scan requests didn't cause any connection drop, therefore I didn't attach any log about such configuration.

In the attached tgz you'll find:

- the ipw2200_log.sh script, whose output has been redirected to the
  dmesg_x.log files;
- the test.conf used with wpa_supplicant
- 2 dirs containing both dmesg and wpa_supplicant -dd outputs
Comment 12 Maurizio Avogadro 2010-03-10 14:40:41 UTC
Created attachment 25449 [details]
2 more logs
Comment 13 Zhu Yi 2010-03-11 01:55:30 UTC
I see your problem. The scan has been reset because the driver *thinks* it takes too long (5 seconds). You should be able to see the log message with debug=0x800 (IPW_DL_SCAN). Please try to increase IPW_SCAN_CHECK_WATCHDOG defined value in ipw2200.c and see if the problem is moderated.
Comment 14 Maurizio Avogadro 2010-03-12 09:33:59 UTC
Created attachment 25481 [details]
logs [debug=0x800 + IPW_SCAN_CHECK_WATCHDOG (10 * HZ)]

By looking at the logs, it seems that the scan remains sometimes stuck on a channel: increasing IPW_SCAN_CHECK_WATCHDOG value didn't help.

I found an old patch which seems to address a similar - if not the same - issue: http://article.gmane.org/gmane.linux.drivers.ipw2100.devel/8343 (didn't test it yet).
Comment 15 Maurizio Avogadro 2010-03-12 14:31:47 UTC
oops... I see, that patch has been already applied.

Applying the discarded part (aborting scan instead of restarting the adapter) allows me at least not to have my connection dropped every 2 mins when using NM, but the scans always stop after the first 4 channels.
Comment 16 Zhu Yi 2010-03-15 01:47:00 UTC
What's your AP's beacon interval?
Comment 17 Maurizio Avogadro 2010-03-15 09:21:52 UTC
(In reply to comment #16)
> What's your AP's beacon interval?

100 msec (default setting)
Comment 18 Zhu Yi 2010-03-16 02:42:26 UTC
Created attachment 25543 [details]
a patch to try

Please try if this patch. It doesn't restart the adapter unless abort_scan doesn't work.

BTW, from your log, the firmware doesn't _always_ stuck at channel 52. Maybe change the passive channel dwell in ipw_passive_dwell_time() from (priv->assoc_network->beacon_interval - 10) to (priv->assoc_network->beacon_interval - 25) help?
Comment 19 Maurizio Avogadro 2010-03-19 16:46:13 UTC
(In reply to comment #18)
> BTW, from your log, the firmware doesn't _always_ stuck at channel 52. Maybe
> change the passive channel dwell in ipw_passive_dwell_time() from
> (priv->assoc_network->beacon_interval - 10) to
> (priv->assoc_network->beacon_interval - 25) help?

No, unfortunately that didn't help.

> Created an attachment (id=25543) [details]
> a patch to try
> 
> Please try if this patch. It doesn't restart the adapter unless abort_scan
> doesn't work.

The patch works: the connection doesn't drop anymore upon scan requests. I'm waiting to verify its behaviour when SCAN_ABORT command fails (it happened only once when I was testing the first part of Helmut Schaa's first patch, aborting the scan instead of resetting the adapter: if I'm remembering correctly, dmesg reported 'Failed to send SCAN_ABORT: Command timed out', the adapter was completely locked until I forced a reload of the driver).
Comment 20 Maurizio Avogadro 2010-03-29 14:58:44 UTC
(In reply to comment #19)
> [...] I'm
> waiting to verify its behaviour when SCAN_ABORT command fails (it happened
> only
> once when I was testing the first part of Helmut Schaa's first patch,
> aborting
> the scan instead of resetting the adapter: if I'm remembering correctly,
> dmesg
> reported 'Failed to send SCAN_ABORT: Command timed out', the adapter was
> completely locked until I forced a reload of the driver).

The behaviour seems correct:

[...]
Mar 29 16:46:42 tm430 kernel: ipw2200: Failed to send SCAN_ABORT: Command timed out.
Mar 29 16:46:44 tm430 kernel: ipw2200: Failed to send ASSOCIATE: Command timed out.
Mar 29 16:46:45 tm430 kernel: ipw2200: Failed to send CARD_DISABLE: Command timed out.
Mar 29 16:46:47 tm430 wpa_supplicant[2555]: Associated with xx:xx:xx:xx:xx:xx
[...]

The connection is then reestablished and the device works again. This behaviour is not optimal but definitely better than dropping the connection everytime the scan stalls.
Comment 21 John W. Linville 2010-08-18 19:21:15 UTC
commit e65054b64ff6b89380a7f546c9eebf862e679646
Author: Zhu Yi <yi.zhu@intel.com>
Date:   Tue Mar 30 09:36:16 2010 +0800

    ipw2200: restart adapter only when abort_scan doesn't work
    
    When a scan watchdog is fired, try to send abort scan command
    first before restarting the adapter. This avoids reconnection for
    some users when scan hang is detected.
    
    This fixed bug https://bugzilla.kernel.org/show_bug.cgi?id=15419
    
    Reported-by: Maurizio Avogadro <mavoga@gmail.com>
    Tested-by: Maurizio Avogadro <mavoga@gmail.com>
    Signed-off-by: Zhu Yi <yi.zhu@intel.com>
    Signed-off-by: John W. Linville <linville@tuxdriver.com>

Note You need to log in before you can comment on or make changes to this bug.