Bug 24402

Summary: rtl8187 stop searching for network after suspend
Product: Drivers Reporter: David Heidelberg (david)
Component: network-wirelessAssignee: drivers_network-wireless (drivers_network-wireless)
Status: RESOLVED CODE_FIX    
Severity: normal CC: david, herton, htl10, johannes, Larry.Finger, linville
Priority: P1    
Hardware: x86-64   
OS: Linux   
Kernel Version: 2.6.37-rc4-git Subsystem:
Regression: No Bisected commit-id:
Attachments: dmesg
Proposed patch for driver rtl8187
part_of_new_dmesg
Revised patch for rtl8187
dmesg_new
Third version of patch for rtl8187
4th_dmesg
Fourth patch version
5th_dmesg
Avoid calling ieee80211_work_work when not needed

Description David Heidelberg 2010-12-06 13:07:36 UTC
Created attachment 39122 [details]
dmesg

Laptop Toshiba A210-15j, 64bit gentoo.

Suspend done by "hibernate-ram"

Dmesg contain probably related warning.

Bus 001 Device 002: ID 0bda:8197 Realtek Semiconductor Corp. RTL8187B Wireless Adapter
Comment 1 Hin-Tak Leung 2010-12-07 00:30:38 UTC
Does 
ifconfig down, modprobe -r, modprobe, ifconfig up,
or 
ifconfig down/up
or
modprobe -r/modprobe
?

reset the device to a usable state? and or modprobe -r/modprobe mac80211 ?

The problem is not *after suspend*, but *before* suspend:

[ 1914.662031] WARNING: at net/mac80211/work.c:884 ieee80211_work_work+0x363/0x1240()
[ 1914.662033] Hardware name: Satellite A210
[ 1914.662035] work scheduled while going to suspend
[ 1914.662038] Pid: 3971, comm: kworker/u:1 Not tainted 2.6.37-rc4-00153-g59e57c6-dirty #5
[ 1914.662041] Call Trace:
[ 1914.662046]  [<ffffffff815ba9a3>] ? ieee80211_work_work+0x363/0x1240
...
[ 1914.662108]  [<ffffffff815ada16>] ? ieee80211_unregister_hw+0x46/0x100
[ 1914.662112]  [<ffffffff815e0fa7>] ? rtl8187_disconnect+0x30/0x6f
[ 1914.662117]  [<ffffffff813dd956>] ? usb_unbind_interface+0x56/0x150
...
Comment 2 Larry Finger 2010-12-07 01:49:02 UTC
Created attachment 39212 [details]
Proposed patch for driver rtl8187

Good catch Hin-Tak.

Please try this patch. I cannot test as my laptop does not suspend.
Comment 3 David Heidelberg 2010-12-07 20:19:01 UTC
Patch tested, but it doesn't helped. Behaviour same, including warning.
Comment 4 Hin-Tak Leung 2010-12-08 02:51:24 UTC
(In reply to comment #3)
> Patch tested, but it doesn't helped. Behaviour same, including warning.

A new dmesg? The oops should be at a different place.

Also the problem *before* suspend seems to be related to an active scan - does it helps if you rfkill the device (i.e. slide the radio switch underneath the front of the keyboard) or ifconfig down it before suspend?
Comment 5 Larry Finger 2010-12-08 03:03:33 UTC
Where is the new dmesg?
Comment 6 David Heidelberg 2010-12-08 10:21:44 UTC
when wifi rfkill is off while suspending, it work
Comment 7 David Heidelberg 2010-12-08 10:29:35 UTC
Created attachment 39232 [details]
part_of_new_dmesg
Comment 8 Larry Finger 2010-12-08 19:24:05 UTC
Created attachment 39292 [details]
Revised patch for rtl8187

Please try this patch. It appears that the work queue must be stopped.
Comment 9 David Heidelberg 2010-12-08 20:11:06 UTC
Created attachment 39322 [details]
dmesg_new

After applying patch, still failing
Comment 10 Larry Finger 2010-12-08 21:12:01 UTC
Created attachment 39352 [details]
Third version of patch for rtl8187

This patch flushes the work queue.
Comment 11 David Heidelberg 2010-12-09 15:26:25 UTC
Created attachment 39612 [details]
4th_dmesg

I'm afraid it didn't helped
Comment 12 Herton Ronaldo Krzesinski 2010-12-11 03:43:15 UTC
I noticed something on the dmesg attached:
ieee80211 phy0: hwaddr 00:c0:a8:ff:54:49, RTL8187BvE V0 + rtl8225z2, rfkill mask 2

In fact it is a B device, so why the patches don't work, because of the "if (!priv->is_rtl8187b)" check. It's strange the warning is happening now, I guess some code changed in mac80211 and may be triggering this, I don't know. I'll try to get access to an 8187b again too and try to reproduce it.

okias, can you comment the if check that the last patch adds in rtl8187_dev.c and test again? The code on rtl8187_disconnect function should look like this after commenting:

...
	priv = dev->priv;
	/*if (!priv->is_rtl8187b) {*/
		ieee80211_stop_queues(dev);
		flush_delayed_work(&priv->work);
	/*}*/
...
Comment 13 Larry Finger 2010-12-11 04:26:56 UTC
Created attachment 39762 [details]
Fourth patch version

The previous suggestion will certainly crash as the workqueue is not initialized for an RTL8187B.

We do, however, need to stop the ieee80211 queues for both kinds of devices. Try the newest version of the patch.
Comment 14 David Heidelberg 2010-12-11 16:48:18 UTC
Created attachment 39852 [details]
5th_dmesg

Failed again
Comment 15 Herton Ronaldo Krzesinski 2010-12-11 17:09:00 UTC
(In reply to comment #13)
> The previous suggestion will certainly crash as the workqueue is not
> initialized for an RTL8187B.

Indeed, I forgot that priv->work is only used with rtl8187l

But what bothers me is why this is happening now, we don't have any relevant changes in 2.6.36..2.6.37-rc4-git in rtl8187. Also this last patch also failed too.

Looking at mac80211 code and rtl8187 code again, I think we have another problem, not in rtl8187. I don't see how rtl8187 is queuing work in this case which triggers the warning, I think it comes from somewhere else:

__ieee80211_suspend sets local->suspended = true, but before that, it already calls ieee80211_stop_queues (in fact it calls ieee80211_stop_queues_by_reason(hw, IEEE80211_QUEUE_STOP_REASON_SUSPEND), which is the same as we calling ieee80211_stop_queues from the driver, only reason is different).

Also before setting "local->suspended = true", __ieee80211_suspend also calls the stop callback in the driver:

...
        if (local->open_count)
                ieee80211_stop_device(local);
...

In our stop callback, we cancel all delayed work in priv->work of rtl8187.

Besides this, all work we schedule is the led work, but in rtl8187_disconnect, we call rtl8187_leds_exit before ieee80211_unregister_hw (which triggers the warning).

Also, the last led command should come before "local->suspended = true", when led is set to off in ieee80211_stop_device, and workqueue is flushed there, so shouldn't cause issues. Anyway, if we were scheduling work after "local->suspended = true" in mac80211, we would be triggering the warning before when calling ieee80211_queue_delayed_work.
Comment 16 Larry Finger 2010-12-11 21:51:29 UTC
Herton makes some really good points, which makes me wonder if the bug really is in rtl8187.

You did not list this as a regression. Have you tested the standard kernel for your distribution?
Comment 17 David Heidelberg 2010-12-11 21:57:40 UTC
I guess it worked in 2.6.35 or 2.6.36, but not sure. I updating kernel from git, so I don't know for sure. I could try bisect between some work revision, but it can takes lot of hour rebooting and testing.
Comment 18 Herton Ronaldo Krzesinski 2010-12-11 22:29:05 UTC
I think this problem should happen since linux 2.6.34, after the commits:

commit af6b637 "mac80211: generalise work handling" +
commit b8bc4b0 "mac80211: support remain-on-channel command"

The first one added the ieee80211_work_purge function calling it on ieee80211_stop, and the second is what would appear in a bisect, it's the one which should be causing the warning, because it adds the ieee80211_work_work on ieee80211_work_purge (ieee80211_work_work checks the local->suspended and does the warning).

Because rtl8187 doesn't have suspend/resume hooks, usb core on suspend removes rtl8187 after mac80211 is suspended, calling rtl8187_disconnect. But on rtl8187_disconnect, we call ieee80211_unregister_hw, which calls dev_close, which calls ieee80211_stop, and in the end calls ieee80211_work_purge->ieee80211_work_work)

In my mind there are two fixes we can do:
- We add suspend/resume hooks on rtl8187, or
- Do something on mac80211 to avoid situation above, to support drivers without suspend/resume hooks.
Comment 19 Herton Ronaldo Krzesinski 2010-12-12 02:42:14 UTC
Created attachment 39892 [details]
Avoid calling ieee80211_work_work when not needed

okias, please check this patch.

For example, zd1211rw is another case which could have this problem too, as rtl8187 it doesn't have suspend/resume hooks. So seems sensible to make a fix in mac80211.
Comment 20 David Heidelberg 2010-12-12 13:23:16 UTC
Base problem fixed by last patch. After suspend NM wasn't able to work propertly, only after RFkill wifi and reload, it start work. But it's success :-)
Comment 21 Herton Ronaldo Krzesinski 2010-12-13 13:08:09 UTC
Ok, so really the warning was bogus at least (no work in work_list, as I would expect after mac80211 is suspended). And when there is really work queued on suspend, the warning will trigger.

About you need to rfkill after suspend, may be network-manager is confused about removal of device by usb stack.

Johannes, can you check if the patch on comment #19 for mac80211 is ok? If it is, I'll submit to linux-wireless.
Comment 22 Johannes Berg 2010-12-13 13:14:51 UTC
Yes, that seems reasonable. I guess the only change is actually killing the warning in this case.
Comment 23 David Heidelberg 2010-12-13 14:42:45 UTC
No, with that patch I'm able use wifi even after suspend. Without patch it doesn't work at all.
Comment 24 Hin-Tak Leung 2010-12-13 15:07:33 UTC
(In reply to comment #23)
> No, with that patch I'm able use wifi even after suspend. Without patch it
> doesn't work at all.

There is still the mis-understanding - the problem is not 'after suspend' (waking up), but "before suspend" (going to sleep).- the driver does not go to sleep properly, and therefore does not work later.

As for network manager - it usually takes say, 30 seconds between waking up to NM re-establishing association so you may need to wait a little; but either way, there should be some log entries in syslog (/var/log/message or thereabouts) about NetworkManager detecting device state changes, etc. "wasn't able to work propertly" is just a bit vague. - the result of 'grep NetworkManager /var/log/messages' or 'grep -10 NetworkManager /var/log/messages' (include 10 lines before and after any NetworkManager entries) should be informative (but is really outside the scope of kernel bugzilla, really).
Comment 25 Herton Ronaldo Krzesinski 2010-12-13 16:29:02 UTC
Well, the patch really only removes the wrong warning, I don't expect it change behaviour (well, only difference is kernel is not 'warning tainted' anymore), as without it ieee80211_work_work runs but exit early anyway after the WARN, so even without the patch I would expect it not breaking anything.

If even with the patch the warning was still being printed, then we would have something broken, but was not the case, just the warning was bogus.