Bug 43044 - Kernel oops when connecting to wlan 802.11g network
Kernel oops when connecting to wlan 802.11g network
Status: NEW
Product: Drivers
Classification: Unclassified
Component: network-wireless
All Linux
: P1 normal
Assigned To: drivers_network-wireless@kernel-bugs.osdl.org
:
Depends on:
Blocks:
  Show dependency treegraph
 
Reported: 2012-04-04 19:40 UTC by Juho-Mikko Pellinen
Modified: 2016-03-07 15:07 UTC (History)
4 users (show)

See Also:
Kernel Version: vanilla 3.4-rc1
Tree: Mainline
Regression: Yes


Attachments
Kernel oops message when connecting to 802.11g network. (268.61 KB, image/jpeg)
2012-04-04 19:51 UTC, Juho-Mikko Pellinen
Details
1st boot when the network card is probably attaching to wlan network. (336.08 KB, image/jpeg)
2012-04-04 19:54 UTC, Juho-Mikko Pellinen
Details
2nd boot similar to the first one (317.02 KB, image/jpeg)
2012-04-04 19:56 UTC, Juho-Mikko Pellinen
Details
3rd boot similar to the first and second (338.00 KB, image/jpeg)
2012-04-04 19:58 UTC, Juho-Mikko Pellinen
Details
Trial patch to fix oops (535 bytes, patch)
2012-04-05 05:03 UTC, Larry Finger
Details | Diff
dmesg containing the stacktraces (212.68 KB, text/plain)
2012-04-06 20:32 UTC, Juho-Mikko Pellinen
Details
Kernel config. (84.28 KB, application/octet-stream)
2012-04-06 20:34 UTC, Juho-Mikko Pellinen
Details
dmesg containing some 802.11g stacktraces and then the 802.11n messages. (642.25 KB, application/octet-stream)
2012-04-06 21:09 UTC, Juho-Mikko Pellinen
Details
In this dmesg part I configured the AP to 802.11n, 40Mhz, Control Sideband: Upper (27.81 KB, application/octet-stream)
2012-04-06 21:14 UTC, Juho-Mikko Pellinen
Details
Connection to 802.11g with stacktraces (273.36 KB, application/octet-stream)
2012-04-07 17:05 UTC, Juho-Mikko Pellinen
Details
Patch to limit WARN statements from mac80211 (763 bytes, patch)
2012-04-07 18:18 UTC, Larry Finger
Details | Diff
The number of stacktraces went down after the patch. (17.92 KB, application/octet-stream)
2012-04-07 20:38 UTC, Juho-Mikko Pellinen
Details
dmesg from 3.4.0-rc2-wl (167.99 KB, application/x-gzip)
2012-04-12 17:41 UTC, Juho-Mikko Pellinen
Details
dmesg from 3.4.0-rc5-wl (334.46 KB, text/plain)
2012-05-05 11:56 UTC, Juho-Mikko Pellinen
Details
Kernel config for 3.4.0-rc5-wl (86.46 KB, text/plain)
2012-05-05 11:59 UTC, Juho-Mikko Pellinen
Details
Patch to fix race condition when firmware already cached (2.69 KB, patch)
2012-05-05 15:17 UTC, Larry Finger
Details | Diff
Crash (681.43 KB, image/jpeg)
2012-05-19 21:19 UTC, Juho-Mikko Pellinen
Details
Patch to assure that correct USB read buffer is used (2.29 KB, patch)
2012-06-19 00:56 UTC, Larry Finger
Details | Diff

Description Juho-Mikko Pellinen 2012-04-04 19:40:27 UTC
The kernel oopsed like shown in attached picture when:

Trying to find optimal AP settings I changed the settings in AP. I had just moments before changed the type of the network from 802.11n to 802.11g. Then I ordered NetworkManager to reconnect and Oops happened. At that time I went to sleep.

In the morning I booted three times to 3.4-rc1 with every time ending up getting a kernel oops. The oopses have been pictured in other attached pictures.

After third time I chose previous version which in my case was 3.3.0-gentoo and with which I didn't encounter any oopses. The performance of the network card is still lacking, but at least it does not oops anymore.

More about my struggles with rel8192ce and additional details are available at https://forums.gentoo.org/viewtopic-t-918728.html .
Comment 1 Juho-Mikko Pellinen 2012-04-04 19:51:21 UTC
Created attachment 72818 [details]
Kernel oops message when connecting to 802.11g network.

I had just changed the network type from 802.11n to 802.11g and ordered NetworkManager to reconnect.
Comment 2 Juho-Mikko Pellinen 2012-04-04 19:54:18 UTC
Created attachment 72819 [details]
1st boot when the network card is probably attaching to wlan network.
Comment 3 Juho-Mikko Pellinen 2012-04-04 19:56:42 UTC
Created attachment 72820 [details]
2nd boot similar to the first one

Note that after the first oops in the evening I hadn't touched the settings in the AP.
Comment 4 Juho-Mikko Pellinen 2012-04-04 19:58:28 UTC
Created attachment 72821 [details]
3rd boot similar to the first and second
Comment 5 Larry Finger 2012-04-05 05:03:35 UTC
Created attachment 72824 [details]
Trial patch to fix oops

This patch should prevent the oops; however, it does not fix the underlying cause. I am hoping that making this change will help in gathering more information about what is happening. I am unable to reproduce this problem when connecting to either an 802.11n or 802.11g AP.

Does your system have CRDA installed?
Comment 6 Juho-Mikko Pellinen 2012-04-05 05:59:06 UTC
[I] net-wireless/crda
     Available versions:  1.0.1-r1 ~1.1.2-r3
     Installed versions:  1.0.1-r1(22.13.52 02.04.2012)

I'll apply the patch and test the patched kernel tonight.
Comment 7 Juho-Mikko Pellinen 2012-04-06 20:30:54 UTC
The patch fixed the oops. I tried again without and with patch and without patch encountered the same error again.

In the meantime I upgraded the NetworkManager to version 0.9.4.0, but it didn't help.

Now that can boot without kernel oops I began getting following attached messages to dmesg.
Comment 8 Juho-Mikko Pellinen 2012-04-06 20:32:17 UTC
Created attachment 72834 [details]
dmesg containing the stacktraces

This is from recompiled kernel. I added some debug-settings to the new kernel.
Comment 9 Juho-Mikko Pellinen 2012-04-06 20:34:42 UTC
Created attachment 72835 [details]
Kernel config.

This kernel config was used for compiling the patched kernel.
Comment 10 Juho-Mikko Pellinen 2012-04-06 20:57:29 UTC
Testing speedtest.net gave me following average results:
Ping 30ms, up/down 6Mbps/5.5Mbps.

That is an improvement from 3.3.0 where I usually got ~35-40ms and less 3Mbps.

I tried moving the computer around so the heat radiator behind computer woulnd't interfere. Moving the case more than 0.5m away from the original location which was quite close to the radiator didn't improve the performance.

Radiator is thin metal radiator containing circulated water. It is located ~15cm away from the antennas.

I tested the 802.11b/g/n connection to another AP (D-Link) but couldn't move any data (latency error on speedtest.net).

Testing to original AP but with 802.11n network 144Mb/s the stacktraces vanished and speed went down to ping 32ms, down 1.1Mbps, up 0.6Mbps.
Comment 11 Juho-Mikko Pellinen 2012-04-06 21:09:24 UTC
Created attachment 72836 [details]
dmesg containing some 802.11g stacktraces and then the 802.11n messages.

The module options were:
options rtl8192ce swenc=1 debug=3 ips=0

The Ap configuration was 802.11n, 20Mhz.
Comment 12 Juho-Mikko Pellinen 2012-04-06 21:14:47 UTC
Created attachment 72837 [details]
In this dmesg part I configured the AP to 802.11n, 40Mhz, Control Sideband: Upper

In this dmesg part I configured the AP to 802.11n, 40Mhz, Control Sideband: Upper, channel 13 which is least crowded.

speedtest.net gives me ping 31ms, down 2Mbps, up 3 Mbps.
Comment 13 Larry Finger 2012-04-06 21:20:37 UTC
Your dmesg output wrapped around and obliterated the part where the failures
first started. You need to capture the output earlier, or change the size of
the buffer. That is controlled by CONFIG_LOG_BUF_SHIFT in .config. You
currently have a value of 18 - please increase it to 20.

Did you notice that once the driver called CRDA, the WARN messages stopped? I
suspect that if you set CONFIG_CFG80211_INTERNAL_REGDB in .config, then the
messages will never occur.

Your speedtest.net rates are measuring the speed of your broadband connection. When I test on a local 802.11g network, I get ~18 Mbps and about 40 Mbps on an 802.11n link.
Comment 14 Juho-Mikko Pellinen 2012-04-07 06:48:15 UTC
My broadband connection is 100Mbps down and 10Mbps up and with wired connection I get actual speeds of around 85-90Mbps down and 10-11Mbps up.

The speed of broadband connection is not an issue here. And it is the reason why I'm not satisfied with 802.11g.
Comment 15 Larry Finger 2012-04-07 16:10:44 UTC
OK. With 802.11g, the most you will get with any driver is about 27 Mbps. With the kernel version of rtl8192cu on an 802.11n connection, I only get about 20 TX and RX.

If you want the best performance, then you should get driver RTL8192CU_8188CUS_8188CE-VAU_linux_v3.1.2590.20110922 from the Realtek web site. It does up to 68 RX and 45 TX.

My time has been spent making the in-kernel drivers be stable. When we get close to that condition, then I will be able to concentrate on the performance.
Comment 16 Juho-Mikko Pellinen 2012-04-07 17:05:58 UTC
Created attachment 72844 [details]
Connection to 802.11g with stacktraces

I recompiled the kernel with following changes:
turned off CONFIG_CFG80211_WEXT
turned on CONFIG_EXPERT and CONFIG_CFG80211_INTERNAL_REGDB
increased CONFIG_LOG_BUF_SHIFT to 20

Rebooted with blacklisted rtl8192ce to prevent its loading.

Running "modprobe -v rtl8192ce && /etc/init.d/NetworkManager restart" so that NM detects and connects to the new network card caused my log files to overflow on disk so I a matter of few seconds my disklog of 5Mb got filled with stacktraces.

rmmodding rtl8291ce stopped the flow.

The second time I modprobed first without restarting NM and than took care to rmmod the driver as soon as I saw the first stacktraces.

Modprobing happens around 814.606881 and NM restart around ~888-890.

rmmodding happens just before the log-file ends.

At this point I don't care that much about the performance. The stability is much more important.
Comment 17 Juho-Mikko Pellinen 2012-04-07 17:16:42 UTC
Attaching to 802.11n network didn't produce any stacktraces.

I connected the both wlan cards to the same AP which is running b/g/n -mode and went around iw tools looking at settings. At that time I saw the following:

# iw dev wlan0 station dump
Station 1c:af:xx:xx:xx:xx (on wlan0)
	inactive time:	976 ms
	rx bytes:	717993
	rx packets:	4736
	tx bytes:	81747
	tx packets:	695
	tx retries:	61
	tx failed:	3
	signal:  	39 dBm
	signal avg:	-48 dBm
	tx bitrate:	48.0 MBit/s
	authorized:	yes
	authenticated:	yes
	preamble:	long
	WMM/WME:	no
	MFP:		no
	TDLS peer:		no
xxxx drivers # iw dev wlan1 station dump
Station 1c:af:xx:xx:xx:xx (on wlan1)
	inactive time:	8892 ms
	rx bytes:	25352
	rx packets:	204
	tx bytes:	2684
	tx packets:	26
	tx retries:	0
	tx failed:	0
	signal:  	-54 dBm
	signal avg:	-53 dBm
	tx bitrate:	300.0 MBit/s MCS 15 40Mhz short GI
	authorized:	yes
	authenticated:	yes
	preamble:	long
	WMM/WME:	yes
	MFP:		no
	TDLS peer:		no

Should the signal really be negative? on wlan0 the signal avg varies wildly between negative and positive, wlan1 is all the time at -53 dBm.
Comment 18 Larry Finger 2012-04-07 18:15:21 UTC
Yes, the signal should be negative, ar a small positive number. The +39 dBm is clearly bogus.

The warning from ieee80211_get_tx_rate(), which is what you are seeing from rtl8192ce, is being discussed on the linux-wireless mailing list. I have submitted a patch to only issue it once. I will attach that one next.
Comment 19 Larry Finger 2012-04-07 18:18:13 UTC
Created attachment 72846 [details]
Patch to limit WARN statements from mac80211

This patch will prevent the warning in ieee80211_get_tx_rate() from spamming the logs by converting from WARN_ON to WARN_ON_ONCE.
Comment 20 Juho-Mikko Pellinen 2012-04-07 20:38:52 UTC
Created attachment 72848 [details]
The number of stacktraces went down after the patch.

Patch improved the current situation and now I get only few stacktraces. Speedtest.net scored me 8Mbps/8Mbps at best with 802.11g which is one of the best results so far.

The bogus dbm-number was created by the zd1211rw -driver. I have to investigate it later.

802.11n scores slightly lower performance, but does not show any regressions. No stacktraces there either.

I still get a number of CRDA messages:
"Apr 07 23:36:12 [kernel] [14654.742258] cfg80211: Calling CRDA for country: FI
Apr 07 23:36:12 [kernel] [14654.754289] cfg80211: Regulatory domain changed to country: FI ........."

I just installed netperf and I'm now investigating it. I'll setup a testing configuration later on.
Comment 21 Juho-Mikko Pellinen 2012-04-07 20:42:39 UTC
Interesting problem. I don't know if this worked before with 802.11g:
802.11n: 
# iw wlan1 station dump
Station cc:5d:xx:xx:xx:xx (on wlan1)
	inactive time:	1089 ms
	rx bytes:	9985583
	rx packets:	14523
	tx bytes:	10868372
	tx packets:	11401
	tx retries:	0
	tx failed:	0
	signal:  	-50 dBm
	signal avg:	-49 dBm§
	tx bitrate:	300.0 MBit/s MCS 15 40Mhz short GI
	authorized:	yes
	authenticated:	yes
	preamble:	long
	WMM/WME:	yes
	MFP:		no
	TDLS peer:		no

802.11g:
# iw wlan1 station dump
failed to parse nested attributes!
Comment 22 Juho-Mikko Pellinen 2012-04-07 20:59:17 UTC
After quick test of the rtl8192ce module options with 802.11n it seems that "swenc=1" is necessary for successful association.
Now current working options are:
options cfg80211 ieee80211_regdom=FI
options rtl8192ce swenc=1 debug=3
Comment 23 Larry Finger 2012-04-07 22:29:17 UTC
Strange - hardware encryption (default value of 0 for swenc) is OK here.

WARNING: at net/mac80211/tx.c:770 invoke_tx_handlers+0x85c/0x1520 [mac80211]() is due to a broken rate control setup.

WARNING: at net/mac80211/tx.c:55 invoke_tx_handlers+0x1424/0x1520 [mac80211]() is also caused by a bad rate control setup.

WARNING: at include/net/mac80211.h:1330 rtl_get_tcb_desc+0x3d8/0x5b0 [rtlwifi]() is the one that was just patched - also rate control.

Are the first two coming from rtlwifi or zd1211rw? I cannot tell from what is posted.

What kernel are you using? I couldn't find it in any of the material.

If possible, could you clone the git tree from

git://git.kernel.org/pub/scm/linux/kernel/git/linville/wireless-testing.git

That has all the latest patches. With it and rtl8192ce, I get up to 50 Mbps RX and 38 Mbps TX. If not then I will try to get you a patch to upgrade whatever kernel you are using to mine. Those performance numbers are higher than I posted earlier because I thought you had a USB device.
Comment 24 Juho-Mikko Pellinen 2012-04-11 19:32:42 UTC
I have been using 3.4-rc1 with the null-pointer exception fix and WARN_ON_ONCE fix. Network mode is 802.11n, 40Mhz, control sideband Upped and channel 13. Module options used were swenc=1 debug=3.
I unplugged the zd1211rw - card (wlan0) after the rtl8192ce began working adequately so the following logs contain only lines related to wlan1/rtl8192ce.

My computer has been up now four (4) days and has lost connectivity twice. Both times I rmmodded and modprobed the rtl8192ce after which I regained connectivity. The speed of the connection has varied somewhat - from really sluggish to ~7Mbps up/down. 
Occasionally NetworkManager reconnects and asks for wlan password even though it remembers it. I have the logs stored, but I don't have just now time to go through and process those.

I'm now going through the process of cloning and building the wireless-testing after which I'll reboot and remove all module options.

Does wireless-testing -git contain the WARN_ON_ONCE-fix or/and the null-pointer fix?

I also have the 3.4-rc2 ready for testing, but haven't yet booted it.
Comment 25 Larry Finger 2012-04-12 03:21:46 UTC
Wireless-testing has both patches included. If 3.4-rc2-wl shows the warning, please post the entire dmesg output. I still do not know why the rate control routines are being setup wrong.
Comment 26 Juho-Mikko Pellinen 2012-04-12 17:41:50 UTC
Created attachment 72901 [details]
dmesg from 3.4.0-rc2-wl

Kernel: 3.4.0-rc2-wl
Module options:
options cfg80211 ieee80211_regdom=FI
rtl8192ce: defaults (none set)

First connected to 802.11n, 40Mhz, Control Sideband Upper, Channel 13, WPA2-PSK AES
Apr 12 20:30:47 [kernel] [42875.488085] changed the same AP to 802.11g, Channel 13.
The stacktraces appear only on 802.11g network.

Speeds: (speedtest, still no netperf)
802.11n 5Mbps down, 10Mbps up
802.11g 10Mbps down, 10Mbps up
Comment 27 Juho-Mikko Pellinen 2012-05-05 11:56:59 UTC
Created attachment 73192 [details]
dmesg from 3.4.0-rc5-wl

# iw wlan1 station dump
Station cc:5d:4e:7a:84:e7 (on wlan1)
	inactive time:	34283 ms
	rx bytes:	93710
	rx packets:	553
	tx bytes:	24396
	tx packets:	129
	tx retries:	0
	tx failed:	0
	signal:  	-50 dBm
	signal avg:	-50 dBm
	tx bitrate:	300.0 MBit/s MCS 15 40Mhz short GI
	authorized:	yes
	authenticated:	yes
	preamble:	long
	WMM/WME:	yes
	MFP:		no
	TDLS peer:		no

wireless-testing was up-to-date and in use at the time of making this comment.
Comment 28 Juho-Mikko Pellinen 2012-05-05 11:59:42 UTC
Created attachment 73193 [details]
Kernel config for 3.4.0-rc5-wl
Comment 29 Larry Finger 2012-05-05 15:17:46 UTC
Created attachment 73198 [details]
Patch to fix race condition when firmware already cached

This past week, a race condition was found where ieee80211 operations were started before the initialization was finished when the firmware file had previously been read by user space, and was available in cache.

I think this will fix your problem. The patch has been submitted, but not yet added to wireless-testing. As soon as it is, it should be pushed to 3.4 mainline, and backported to stable.
Comment 30 Juho-Mikko Pellinen 2012-05-06 17:46:00 UTC
After adding your patch the things have now quieted down for at least 4 hours. I will not comment here again if the problems do not reappear. I'll try to do some extended testing during workdays.

The datarate has risen and is now 10Mbps/10Mbps for 802.11n (40Mhz). It is also an improvement.

Thanks for the help!
Comment 31 Juho-Mikko Pellinen 2012-05-13 18:37:16 UTC
After using the patched kernel for a week I haven't encountered any problems.
Has the patch been included in the wireless-testing or mainline (rcX) kernels?
Comment 32 Larry Finger 2012-05-13 22:19:02 UTC
The patch has been sent to wireless-testing with the recommendation that it be included in 3.4; however, it has not yet been merged with that tree, nor the mainline one.

The wireless maintainer reported that his day job kept him busy last week and that he was behind.
Comment 33 Juho-Mikko Pellinen 2012-05-19 21:19:45 UTC
Created attachment 73336 [details]
Crash

The patched kernel ran quite well until I installed NetworkManager once again. Before I used wpa_supplicant and dhclient manually.
With NetworkManager the continuous "Reason 7" deassociations appeared again and after a while I got this crash. The kernel was spiced with the race-condition patch mentioned earlier. Before this event I had pulled and compiled the 3.4.0-rc7-wl which I'm now running with which I haven't been yet able to reproduce this oops.
Comment 34 bminaker 2012-06-18 19:16:18 UTC
I can report that using any of the stock Arch Linux kernels newer than 3.4-1, or Fedora 17 with kernel 3.4.2, breaks the rtl8192cu driver.  Times out trying to connect (using NetworkManager in KDE).  Reverting to 3.3.8-1 solves the problem.
Comment 35 Larry Finger 2012-06-19 00:56:48 UTC
Created attachment 73831 [details]
Patch to assure that correct USB read buffer is used

This patch ensures that there is not a race condition when selecting and using the USB read buffer. I have had the patch for some time, but no real testers. If you can, please test and report.

Note You need to log in before you can comment on or make changes to this bug.