Bug 108461 - rtl8821ae causing 100ms scheduling latency
Summary: rtl8821ae causing 100ms scheduling latency
Status: NEW
Alias: None
Product: Drivers
Classification: Unclassified
Component: network-wireless (show other bugs)
Hardware: All Linux
: P1 normal
Assignee: drivers_network-wireless@kernel-bugs.osdl.org
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2015-11-25 20:20 UTC by James Ettle
Modified: 2015-11-29 23:49 UTC (History)
2 users (show)

See Also:
Kernel Version: 4.2.6
Tree: Fedora
Regression: No


Attachments
config file (156.48 KB, text/plain)
2015-11-28 15:35 UTC, James Ettle
Details

Description James Ettle 2015-11-25 20:20:09 UTC
My desktop has an integrated 1RTL8821AE 802.11ac PCIe Wireless Network Adapter, ID 0ec:8821, using the rtl8821ae driver. This occasionally causes a ~100ms latency, as measured using perf. I first noticed this as audio stuttering when Pulseaudio wasn't scheduled in time. I believe rtl8821ae is to blame, since I see no latency problems when the wifi adapter is switched off.

To track down the latency, I:

1. Ran perf sched record for a while.

2. Checked perf sched latency to spot when the latency occurred.

3. Looked in perf sched script's output to see what was going on at the time.

Here's a 100ms scheduling delay in watchdog/2 at 170597.339822:


  wpa_supplicant  1168 [002] 170597.339814: sched:sched_wakeup: kworker/2:2:22637 [120] success=1 CPU:002
  wpa_supplicant  1168 [002] 170597.339820: sched:sched_stat_runtime: comm=wpa_supplicant pid=1168 runtime=692443 [ns] vruntime=4718
2887560 [ns]
  wpa_supplicant  1168 [002] 170597.339822: sched:sched_switch: wpa_supplicant:1168 [120] R ==> watchdog/2:20 [0]


Another in kworker/1 at 171226.630376:


         swapper     0 [003] 171226.629871: sched:sched_wakeup: skype:3592 [120] success=1 CPU:003
         swapper     0 [003] 171226.629874: sched:sched_switch: swapper/3:0 [120] R ==> skype:3592 [120]
  wpa_supplicant  1168 [001] 171226.630372: sched:sched_switch: wpa_supplicant:1168 [120] R ==> migration/1:14 [0]
     migration/1    14 [001] 171226.630376: sched:sched_switch: migration/1:14 [0] S ==> kworker/1:0:24376 [120]


And one more in  at 171604.146285 in watchdog/1:


  wpa_supplicant  1168 [001] 171604.146276: sched:sched_wakeup: kworker/1:0:24376 [120] success=1 CPU:001
  wpa_supplicant  1168 [001] 171604.146283: sched:sched_stat_runtime: comm=wpa_supplicant pid=1168 runtime=50022 [ns] vruntime=14207
4593534 [ns]
  wpa_supplicant  1168 [001] 171604.146284: sched:sched_switch: wpa_supplicant:1168 [120] R ==> watchdog/1:13 [0]
      watchdog/1    13 [001] 171604.146287: sched:sched_switch: watchdog/1:13 [0] S ==> kworker/1:0:24376 [120]


They all seem to follow wpa_supplicant.
Comment 1 James Ettle 2015-11-25 22:35:49 UTC
Note - I mispasted the dev id, it's 10ec:8821.
Comment 2 Larry Finger 2015-11-26 03:14:03 UTC
There are two things to try. Kernel 4.2.6 included a patch that reverted an earlier patch that disabled a clearing of RTL8821AE interrupts before they are set again, which led to kernel oops on boot.

First, I would disable MSI interrupts with the following (as root):

echo "msi=0" > /etc/modprobe.d/50-rtl8821ae.conf
modprobe -rv rtl8821ae
modprobe -v rtl8821ae

Then run tests to see if you still have audio stuttering.

If that test still fails, then do the following (again as root):

echo "int_clear=0" > /etc/modprobe.d/50-rtl8821ae.conf
modprobe -rv rtl8821ae
modprobe -v rtl8821ae

Now retest. Report both sets of results.

If neither changes help, please post the exact set of perf commands you used. I have not used that utility before, and I'm having some difficulty in repeating your tests.
Comment 3 James Ettle 2015-11-26 19:27:04 UTC
Shouldn't that be:

  echo "options rtl8821ae msi=0" > /etc/modprobe.d/50-rtl8821ae.conf

Testing now...

I should point out that this has been happening since at least kernel 4.1.6, the version around when I got the machine.

I'm not much of a perf expert either, just read enough of the man pages to try and figure out how to spot scheduling latencies. (Starting point here was the PA log reporting a scheduling delay.)

The commands are as follows. First I run:

  # perf sched record

for a few minutes with audio running, or until I hear a glitch. Press Ctrl+C to stop. Then I run

  # perf sched latency

This gives a top-like list of the scheduling delays for various processes, their maximum delay and when that occurred. I pick out the worst (~100ms with an unconnected interface, 20-50ms when connected) and note when it happened.

Finally, I run

  # perf sched script

and search for the time I noted above. (Searching is like 'less' -- hit / key and enter a pattern.) This, as far as I understand, is a trace of what the scheduler did. I inspect what's going on just before the delay.

What I generally find:

Baseline with interface OFF:
Maximum latencies of about 2ms.

When the interface is up BUT NOT CONNECTED:
I see latencies of ~100ms. This is enough to audibly glitch PA. In 'perf sched script', I see that wpa_supplicant is always scheduled just beforehand.

When the interface is CONNECTED:
I see latencies of around 20-50ms, so higher than the baseline. PA seems to be able to cope with this, so I don't hear a glitch. Also the scheduling pattern in the 'perf sched script' is different - I don't see wpa_supplicant being scheduled just beforehand. But I do see kworker/u8:3 a lot before, not sure if this is doing something for rtl8821ae.

This happens with either no module options or msi=0. I haven't tested int_clear=0 yet.
Comment 4 James Ettle 2015-11-26 20:42:42 UTC
With *just* 'int_clear=0', I get the same sort of results as with no options or just 'msi=0'.
Comment 5 Larry Finger 2015-11-27 22:16:47 UTC
OK, interrupts are not the problem. Yes, you are correct that the driver name is needed in the options line.

For some reason, I will not be able to duplicate your tests. When I use

perf sched record

my system responds with "invalid or unsupported event: 'sched:sched_switch'". I'm still trying to solve that problem.

When the wifi driver is not connected, the only usage of wpa_supplicant should be to trigger scans. What method are you using to control the network?

Does 'grep scan /etc/wpa_supplicant.conf' yield any output?
Comment 6 James Ettle 2015-11-27 22:28:49 UTC
(In reply to Larry Finger from comment #5)
> OK, interrupts are not the problem. Yes, you are correct that the driver
> name is needed in the options line.
> 
> For some reason, I will not be able to duplicate your tests. When I use
> 
> perf sched record
> 
> my system responds with "invalid or unsupported event:
> 'sched:sched_switch'". I'm still trying to solve that problem.

No idea what this means. If it helps, I'm using the stock Fedora kernel which has the following perf config options:

CONFIG_PERF_EVENTS_INTEL_UNCORE=y
CONFIG_CGROUP_PERF=y
CONFIG_HAVE_PERF_EVENTS=y
CONFIG_PERF_EVENTS=y
CONFIG_HAVE_PERF_EVENTS_NMI=y
CONFIG_HAVE_PERF_REGS=y
CONFIG_HAVE_PERF_USER_STACK_DUMP=y

> When the wifi driver is not connected, the only usage of wpa_supplicant
> should be to trigger scans. What method are you using to control the network?

NetworkManager-1.0.6-8.fc23.x86_64 with standard Gnome 3.18.2 UI utils. This is a pretty standard Fedora 23 install.

> Does 'grep scan /etc/wpa_supplicant.conf' yield any output?

Nothing at all.
Comment 7 James Ettle 2015-11-27 22:34:35 UTC
I should also point out: I've another two F23 machines with the same versions of everything, one using ath9k and the other iwl4965. Neither have any odd latency issues, so I'm not convinced it's anything else in the wireless stack.
Comment 8 Larry Finger 2015-11-27 22:53:01 UTC
My configuration has

CONFIG_PERF_EVENTS_INTEL_UNCORE=y
CONFIG_CGROUP_PERF=y
CONFIG_HAVE_PERF_EVENTS=y
CONFIG_PERF_USE_VMALLOC=y
CONFIG_PERF_EVENTS=y
CONFIG_DEBUG_PERF_USE_VMALLOC=y
CONFIG_HAVE_PERF_EVENTS_NMI=y
CONFIG_HAVE_PERF_REGS=y
CONFIG_HAVE_PERF_USER_STACK_DUMP=y

All of the PERF options are selected.
Comment 9 James Ettle 2015-11-28 15:35:47 UTC
Created attachment 195661 [details]
config file

I've attached the config file for the kernel I'm currently using, but I don't know if that'll help -- I'm otherwise completely inexpert in perf and tracking down latencies.

The machine in question has an i5-5200U processor, so 2 cores... I guess if your machine has more cores on which to schedule stuff it's less likely you'd see (or hear!) the delay.
Comment 10 Larry Finger 2015-11-29 22:10:03 UTC
My CPU is an i7-4600M CPU @ 2.90GHz, thus I also have 2 cores plus hyperthreading.

I did install Fedora 23 and tested it. One thing I do not understand is how you keep an interface unconnected. I know how to do it with the KDE applet, but not with Gnome. I went back to openSUSE 13.2 where I finally managed to get a version of perf that works. The problem was the package included with openSUSE could not handle the sched option; however, when I built and installed the version included with the kernel source, then it works.

The longest scheduling delays happen when scanning. It doesn't seem to matter whether the interface is connected or not. I trigger the scans with a 'sudo iwlist <iface> scan' command. The longest delays are usually 40-50 ms and happen in the scheduling of kthreadd or one of the kernel workers. Obviously, I need to investigate what happens when scanning.
Comment 11 James Ettle 2015-11-29 23:49:13 UTC
(In reply to Larry Finger from comment #10)
 
> I did install Fedora 23 and tested it. One thing I do not understand is how
> you keep an interface unconnected. I know how to do it with the KDE applet,

Thanks for going to all that effort!

To prevent connection, assuming you've selected and set up a network: Go into WiFi Settings. There's a '*' button next to the network. Click on that and select the 'Identity' page. Uncheck 'Connect automatically' and click 'Apply'.

Note You need to log in before you can comment on or make changes to this bug.