Bug 200811 - Very high scheduling latency on Ryzen systems
Summary: Very high scheduling latency on Ryzen systems
Status: RESOLVED CODE_FIX
Alias: None
Product: Platform Specific/Hardware
Classification: Unclassified
Component: x86-64 (show other bugs)
Hardware: x86-64 Linux
: P1 normal
Assignee: platform_x86_64@kernel-bugs.osdl.org
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2018-08-14 03:36 UTC by Niklas Haas
Modified: 2018-10-24 06:29 UTC (History)
2 users (show)

See Also:
Kernel Version: 4.17.8-gentoo-pcireset
Subsystem:
Regression: No
Bisected commit-id:


Attachments
/proc/config.gz - Machine A (119.00 KB, text/plain)
2018-08-14 03:36 UTC, Niklas Haas
Details
/proc/cpuinfo - Machine A (44.73 KB, text/plain)
2018-08-14 04:17 UTC, Niklas Haas
Details
/proc/meminfo - Machine A (1.32 KB, text/plain)
2018-08-14 04:17 UTC, Niklas Haas
Details
/proc/iomem - Machine A (3.90 KB, text/plain)
2018-08-14 04:17 UTC, Niklas Haas
Details
/proc/ioports - Machine A (1.54 KB, text/plain)
2018-08-14 04:18 UTC, Niklas Haas
Details
/proc/zoneinfo - Machine A (19.77 KB, text/plain)
2018-08-14 04:18 UTC, Niklas Haas
Details
/proc/buddyinfo - Machine A (404 bytes, text/plain)
2018-08-14 04:18 UTC, Niklas Haas
Details
lspci -vvv (92.84 KB, text/plain)
2018-08-14 04:19 UTC, Niklas Haas
Details
/proc/vmstat - Machine A (3.00 KB, text/plain)
2018-08-14 04:22 UTC, Niklas Haas
Details
/proc/softirqs - Machine A (3.95 KB, text/plain)
2018-08-14 04:23 UTC, Niklas Haas
Details
dmidecode - Machine A (22.53 KB, text/plain)
2018-08-14 04:24 UTC, Niklas Haas
Details
`perf record` sample taken during mpv free() loop (394.42 KB, application/x-xz)
2018-08-19 03:11 UTC, Niklas Haas
Details

Description Niklas Haas 2018-08-14 03:36:04 UTC
Created attachment 277837 [details]
/proc/config.gz - Machine A

As an early disclaimer, this is a bug that may or may not be hardware-related
and/or kernel-related. But in lieu of a way to directly contact AMD engineers,
and in the hope that the kernel might be able to do *something* to improve the
situation, as well as to gather information and gain exposure, I'm submitting
it here.

Bug description
===============

On Ryzen systems, scheduling latency seems abnormally high when the system is
under memory pressure, with the threadripper line of CPUs being particularly
affected. By "high" I mean values of >10ms and even hitting 100ms-1000ms under
very heavy load.

Reproduction steps
==================

1. Run `cyclictest --numa -p90 -m`, or any other low-latency application (e.g.
   JACK, PulseAudio, ...). This is a tool contained in the `rt-tests`
   repository (hosted on git.kernel.org).

2. Put the system under heavy memory load. The best way I have found to
   reproduce the issue is using `stressapptest` (https://github.com/stressapptest/stressapptest),
   e.g. with `stressapptest -M 256 -m 32 -s 60`. An alternative is `stress`,
   e.g. `stress --vm 32 --vm-bytes 256M`, or even the `hackbench` included in
   `rt-tests`. Alternatively, compile a large C project. (This is how I
   originally discovered the issue).

   I will point out that in the tests below I have used different methods of
   generating load for different machines, but the results are relatively
   comparable no matter what method of generating load I use. The only
   difference between the testing methods that I noticed is how soon and how
   frequently the latency spikes occur.

3. Observe the 'maximum latency' column of `cyclictest`'s output.

Actual result
=============

The observed latencies are extremely high, even on a kernel built with
CONFIG_PREEMPT=y. Latencies go well into the 100ms range, and can even hit
1000ms. For example, an output may look like this (after running
`stressapptest` for as little as 10 seconds):

# /dev/cpu_dma_latency set to 0us
policy: fifo: loadavg: 9.28 6.91 4.36 1/1483 5765

T: 0 (28043) P:98 I:1000 C: 149780 Min:      2 Act:    3 Avg:   26 Max:   68807
T: 1 (28044) P:98 I:1500 C:  99853 Min:      2 Act:    4 Avg:   38 Max:   68926
T: 2 (28045) P:98 I:2000 C:  74890 Min:      2 Act:    6 Avg:   49 Max:   68245
T: 3 (28046) P:98 I:2500 C:  59887 Min:      2 Act:    7 Avg:   40 Max:   62970
T: 4 (28047) P:98 I:3000 C:  49874 Min:      2 Act:    8 Avg:   39 Max:  105782
T: 5 (28048) P:98 I:3500 C:  42776 Min:      2 Act:    5 Avg:   37 Max:   68180
T: 6 (28049) P:98 I:4000 C:  37428 Min:      2 Act:    4 Avg:   33 Max:   75606
T: 7 (28050) P:98 I:4500 C:  33299 Min:      2 Act:    4 Avg:   30 Max:    1875
T: 8 (28051) P:98 I:5000 C:  29967 Min:      2 Act:    6 Avg:   47 Max:    7764
T: 9 (28052) P:98 I:5500 C:  27245 Min:      2 Act:    7 Avg:   44 Max:    5314
T:10 (28053) P:98 I:6000 C:  24974 Min:      3 Act:    4 Avg:   49 Max:     740
T:11 (28054) P:98 I:6500 C:  23053 Min:      2 Act:    7 Avg:   53 Max:     434
T:12 (28055) P:98 I:7000 C:  21406 Min:      3 Act:    4 Avg:   53 Max:    2973
T:13 (28056) P:98 I:7500 C:  19979 Min:      3 Act:    3 Avg:   56 Max:    4425
T:14 (28057) P:98 I:8000 C:  18730 Min:      2 Act:    8 Avg:   57 Max:     596
T:15 (28058) P:98 I:8500 C:  17628 Min:      2 Act:    6 Avg:   70 Max:    5369
T:16 (28059) P:98 I:9000 C:  16642 Min:      2 Act:    6 Avg:   44 Max:   63759
T:17 (28060) P:98 I:9500 C:  15767 Min:      2 Act:    7 Avg:   43 Max:   62685
T:18 (28061) P:98 I:10000 C:  14978 Min:      2 Act:    6 Avg:   31 Max:   64241
T:19 (28062) P:98 I:10500 C:  14265 Min:      3 Act:    4 Avg:   43 Max:   55385
T:20 (28063) P:98 I:11000 C:  13610 Min:      2 Act:    4 Avg:   41 Max:   58667
T:21 (28064) P:98 I:11500 C:  13026 Min:      2 Act:    6 Avg:   35 Max:   38604
T:22 (28065) P:98 I:12000 C:  12482 Min:      2 Act:    3 Avg:   24 Max:   65180
T:23 (28066) P:98 I:12500 C:  11987 Min:      3 Act:    4 Avg:   21 Max:     301
T:24 (28067) P:98 I:13000 C:  11526 Min:      3 Act:    7 Avg:   43 Max:    1734
T:25 (28068) P:98 I:13500 C:  11099 Min:      3 Act:    4 Avg:   61 Max:    3004
T:26 (28069) P:98 I:14000 C:  10703 Min:      3 Act:    4 Avg:   58 Max:     647
T:27 (28070) P:98 I:14500 C:  10334 Min:      2 Act:    5 Avg:   48 Max:     629
T:28 (28071) P:98 I:15000 C:   9989 Min:      3 Act:    5 Avg:   54 Max:    6032
T:29 (28072) P:98 I:15500 C:   9667 Min:      2 Act:    4 Avg:   51 Max:    3003
T:30 (28073) P:98 I:16000 C:   9365 Min:      3 Act:    7 Avg:   55 Max:     354
T:31 (28074) P:98 I:16500 C:   9081 Min:      3 Act:    7 Avg:   87 Max:    6409

This was run on my personal machine ("Machine A"), built around a 1950X.

On a non-NUMA Ryzen system, the numbers are significantly lower, but still
higher than they should be. For example, typical Ryzen 2700X desktop system
("Machine D") looks something like this:

# /dev/cpu_dma_latency set to 0us
policy: fifo: loadavg: 13.88 6.81 4.45 20/1356 7555

T: 0 ( 7394) P:90 I:1000 C:  62225 Min:      1 Act:  144 Avg:  192 Max:   18190
T: 1 ( 7395) P:90 I:1500 C:  42300 Min:      1 Act:   37 Avg:  253 Max:   17808
T: 2 ( 7396) P:90 I:2000 C:  31767 Min:      1 Act:   15 Avg:  342 Max:   17449
T: 3 ( 7397) P:90 I:2500 C:  25674 Min:      1 Act:   20 Avg:  406 Max:   17945
T: 4 ( 7398) P:90 I:3000 C:  21678 Min:      1 Act:   26 Avg:  452 Max:   14535
T: 5 ( 7399) P:90 I:3500 C:  18741 Min:      1 Act:   15 Avg:  507 Max:   13813
T: 6 ( 7400) P:90 I:4000 C:  16464 Min:      1 Act:   20 Avg:  580 Max:   15629
T: 7 ( 7401) P:90 I:4500 C:  14885 Min:      1 Act: 3364 Avg:  582 Max:   17282
T: 8 ( 7402) P:90 I:5000 C:  13492 Min:      1 Act: 5305 Avg:  630 Max:   15512
T: 9 ( 7403) P:90 I:5500 C:  12363 Min:      1 Act:   16 Avg:  665 Max:   16643
T:10 ( 7404) P:90 I:6000 C:  11478 Min:      1 Act:   24 Avg:  658 Max:   13628
T:11 ( 7405) P:90 I:6500 C:  10669 Min:      1 Act:   37 Avg:  721 Max:   14851
T:12 ( 7406) P:90 I:7000 C:   9992 Min:      2 Act:   27 Avg:  704 Max:   12343
T:13 ( 7407) P:90 I:7500 C:   9408 Min:      2 Act:   29 Avg:  732 Max:   13683
T:14 ( 7408) P:90 I:8000 C:   8853 Min:      2 Act:   22 Avg:  671 Max:   13566
T:15 ( 7409) P:90 I:8500 C:   8400 Min:      2 Act:   23 Avg:  715 Max:   13756

Expected result
===============

On various Intel systems I have tested, the latencies typically stay well
below 10ms on a kernel built with CONFIG_PREEMPT_NONE=y. An example result,
from an intel desktop i3 system ("Machine B") running Ubuntu 16.04 with a
stock `-generic` kernel:

---# ./cyclictest --smp -p90 -m
# /dev/cpu_dma_latency set to 0us
policy: fifo: loadavg: 4.26 3.11 1.44 1/619 14414

T: 0 (14118) P:90 I:1000 C: 305345 Min:      1 Act:    1 Avg:   77 Max:    3212
T: 1 (14119) P:90 I:1500 C: 209376 Min:      1 Act:    2 Avg:   83 Max:    4062
T: 2 (14120) P:90 I:2000 C: 158013 Min:      1 Act:    1 Avg:   87 Max:    5174
T: 3 (14121) P:90 I:2500 C: 126484 Min:      1 Act:    2 Avg:   86 Max:    2814

The results are (obviously) much better on a kernel built with
CONFIG_PREEMPT=y, such as these results from the same system using the ubuntu
`-realtime` kernel:

# /dev/cpu_dma_latency set to 0us
policy: fifo: loadavg: 4.61 3.09 1.38 1/582 4642

T: 0 ( 4360) P:90 I:1000 C: 280040 Min:      1 Act:    2 Avg:    8 Max:     252
T: 1 ( 4361) P:90 I:1500 C: 186655 Min:      1 Act:    1 Avg:    9 Max:     708
T: 2 ( 4362) P:90 I:2000 C: 139990 Min:      1 Act:    2 Avg:   11 Max:     264
T: 3 ( 4363) P:90 I:2500 C: 111992 Min:      1 Act:    1 Avg:   10 Max:     260

These are latencies I'd consider "reasonable" for a budget desktop system
under extreme load. These numbers were obtained by using `stress --vm 4
--vm-bytes 256M` at the same time as `cyclictest --smp -p90 -m`.

To give a comparison to a typical NUMA system, I have measured this relatively
old (E5410s) dual Xeon setup ("Machine C"), using a non-preemptable kernel
(also running Ubuntu 16.04):

# /dev/cpu_dma_latency set to 0us
policy: fifo: loadavg: 20.20 29.30 31.89 1/1704 10951

T: 0 ( 6624) P:90 I:1000 C:  76576 Min:      4 Act:    8 Avg:   12 Max:    1572
T: 1 ( 6625) P:90 I:1500 C:  51052 Min:      4 Act:    6 Avg:   14 Max:    1241
T: 2 ( 6626) P:90 I:2000 C:  38289 Min:      5 Act:    8 Avg:   14 Max:    1734
T: 3 ( 6628) P:90 I:2500 C:  30631 Min:      5 Act:    8 Avg:   17 Max:    1260
T: 4 ( 6629) P:90 I:3000 C:  25526 Min:      5 Act:    7 Avg:   19 Max:    2182
T: 5 ( 6631) P:90 I:3500 C:  21879 Min:      4 Act:    8 Avg:   23 Max:     685
T: 6 ( 6633) P:90 I:4000 C:  19144 Min:      5 Act:    7 Avg:   22 Max:    3776
T: 7 ( 6634) P:90 I:4500 C:  17017 Min:      5 Act:    8 Avg:   22 Max:     938

These numbers are high, but also within the range of what I'd expect for a
non-preemptable kernel. This result was obtained using `stress --vm 8
--vm-bytes 256M`.

Notes on reproduction / causes
==============================

There are several factors determining both the magnitude and presence of
latency spikes. The first factor is that the machine has to be under heavy
memory load in general. Loading a single CPU core is not enough, and does not
appear to correlate with a measurable increase in latencies even on the loaded
core. On a more-or-less idle system, my numbers are fine (latencies in the
1-20μs range, as I'd expect for a PREEMPT kernel). Doing purely computational
work (e.g. data fits in L2) on the CPUs does not trigger the issue, even if
the system is fully loaded this way. I/O load does not trigger the issue, and
the issue still occurs even with no block devices mounted.

The second aspect is that the latencies only increase on the loaded cores. For
example, if I use CPU affinities to exclude a single core (#7 in this case)
from running anything other than `cyclictest`, I get a result that looks like
this, after around 5 minutes of running `hackbench`:

# /dev/cpu_dma_latency set to 0us
policy: fifo: loadavg: 185.30 109.66 47.59 164/1953 19143

T: 0 ( 2678) P:98 I:1000 C: 260477 Min:      2 Act:    6 Avg:    7 Max:  244128
T: 1 ( 2679) P:98 I:1500 C: 173854 Min:      2 Act:    4 Avg:    7 Max:   91214
T: 2 ( 2680) P:98 I:2000 C: 130348 Min:      2 Act:    5 Avg:    9 Max:   80307
T: 3 ( 2681) P:98 I:2500 C: 104261 Min:      2 Act:    6 Avg:   10 Max:   82710
T: 4 ( 2682) P:98 I:3000 C:  86907 Min:      2 Act:    6 Avg:   10 Max:   68245
T: 5 ( 2683) P:98 I:3500 C:  74459 Min:      2 Act:    6 Avg:   13 Max:  129451
T: 6 ( 2684) P:98 I:4000 C:  65092 Min:      3 Act:    7 Avg:   18 Max:  133967
T: 7 ( 2685) P:98 I:4500 C:  58032 Min:      2 Act:    6 Avg:    7 Max:     372
T: 8 ( 2686) P:98 I:5000 C:  52165 Min:      2 Act:    8 Avg:   12 Max:   68300
T: 9 ( 2687) P:98 I:5500 C:  47470 Min:      2 Act:    6 Avg:    7 Max:   25042
T:10 ( 2688) P:98 I:6000 C:  43479 Min:      3 Act:    5 Avg:   12 Max:   90672
T:11 ( 2689) P:98 I:6500 C:  40166 Min:      3 Act:    4 Avg:    7 Max:   30386
T:12 ( 2690) P:98 I:7000 C:  37293 Min:      3 Act:    8 Avg:    8 Max:   50129
T:13 ( 2691) P:98 I:7500 C:  34799 Min:      3 Act:    6 Avg:   11 Max:   84115
T:14 ( 2692) P:98 I:8000 C:  32605 Min:      3 Act:    5 Avg:   16 Max:  1399177
T:15 ( 2693) P:98 I:8500 C:  30720 Min:      3 Act:    6 Avg:    8 Max:   22138
T:16 ( 2694) P:98 I:9000 C:  28998 Min:      2 Act:    8 Avg:   12 Max:   50514
T:17 ( 2695) P:98 I:9500 C:  27487 Min:      3 Act:    6 Avg:    7 Max:   12135
T:18 ( 2696) P:98 I:10000 C:  26100 Min:      3 Act:    5 Avg:   12 Max:  1194399
T:19 ( 2697) P:98 I:10500 C:  24863 Min:      3 Act:    5 Avg:    9 Max:   81541
T:20 ( 2698) P:98 I:11000 C:  23737 Min:      3 Act:    5 Avg:    8 Max:   41699
T:21 ( 2699) P:98 I:11500 C:  22706 Min:      3 Act:    5 Avg:    6 Max:   17104
T:22 ( 2700) P:98 I:12000 C:  21758 Min:      3 Act:    4 Avg:    8 Max:   44769
T:23 ( 2701) P:98 I:12500 C:  20891 Min:      2 Act:    8 Avg:    7 Max:     144
T:24 ( 2702) P:98 I:13000 C:  20087 Min:      3 Act:    4 Avg:    5 Max:      42
T:25 ( 2703) P:98 I:13500 C:  19343 Min:      3 Act:    5 Avg:    6 Max:      51
T:26 ( 2704) P:98 I:14000 C:  18652 Min:      3 Act:    5 Avg:    5 Max:     144
T:27 ( 2705) P:98 I:14500 C:  18009 Min:      3 Act:    6 Avg:    5 Max:      34
T:28 ( 2706) P:98 I:15000 C:  17409 Min:      3 Act:    5 Avg:    6 Max:      39
T:29 ( 2707) P:98 I:15500 C:  16847 Min:      3 Act:    9 Avg:    5 Max:      54
T:30 ( 2708) P:98 I:16000 C:  16321 Min:      3 Act:    6 Avg:    6 Max:      36
T:31 ( 2709) P:98 I:16500 C:  15826 Min:      3 Act:    5 Avg:    8 Max:     866

The extreme results on the threadripper systems are also seemingly linked to
the system's NUMA architecture. In particular, if I constrain my load to a
single NUMA node (e.g. #0), I get lower results, more in line with what you
see from the desktop ryzen CPUs. However, loading the nodes independently
(e.g. with two instances of `stressapptest`, each constrained to a single
node), also didn't appear to trigger the insanely high latencies; so I suspect
it has something to do with simultaneous cross-node memory access. (*)

I also tried running `stressapptest` on node 0, but constraining it to only
allocate from node 1, so every memory access is cross-node. This also did not
reproduce the extreme latency spikes. Finally, I tried running a test designed
to stress the page walking hardware: https://github.com/graphitemaster/xcpumemperf
This did not appear to reproduce the issue, despite ostensibly also doing
a similar thing as `stress` and `stressapptest`.

In an effort to track down the cause, I tried running the ftrace mechanism's
`hwlat` and `preempt` tracers while reproducing the issue, to see if there
were any unusually high hardware latencies or slow kernel functions being run,
but I observed nothing above 10μs. (I also must point out that my own kernel
is tainted by `zfs.ko`, but unloading this did not appear to make any
difference, nor did I see any slow kernel functions that could be related to
it. Finally, the problem does not appear to be I/O-related in general)

(*) This paragraph needs to be taken with some grain of salt, since I did not
repeat these tests often enough to be confident in the results. I will repeat
them when I get the time, if need be.

Notes on the symptoms
=====================

This may or may not be relevant, but one thing I noticed is that the latency
spikes seem to happen at around the same time, and usually only after a short
delay. For example, when running `hackbench`, the latencies will be absolutely
fine for 1-2 minutes, until they suddenly all start skyrocketing at the same
time. It's as if some component common to all of the CPU cores involved in the
test is triggering the latency spikes more or less simultaneously.

The extent and presence of the symptoms can also change over time as multiple
tests are performed, indicating that there might be some element of
statefulness to the problem's cause (memory fragmentation? something to do
with page table entries? ...). I tried playing around with
/proc/sys/vm/compact_memory, but the results were inconclusive.

Another thing that's incredibly obvious is that in addition to the high
measurement results, the system becomes noticeably slow to respond to any user
input while the problem is being reproduced. After switching focus from one
terminal to another it can take a good second to re-paint the contents. This
is especially noticeable while `stressapptest` is running. Under the scenarios
described above that do *not* reproduce the issue (but still put the system
under high load), I do not observe this effect at all.

Things I've tested
==================

I've gone through a long list of ideas in trying to both identify the cause of
and find a practical work around to this issue:

- Booting from a LiveCD (different kernel, no I/O)
- Using an RT-patched kernel (this reduced latencies on an idle system, but
  had no effect under load)
- Disabling various BIOS options related to the CPU, including:
  * Frequency scaling
  * Adaptive Overclocking
  * C-States
  * Cool & Quiet
  * SMT
  * Possibly others I have forgotten
- Disabling the IOMMU
- Modifying the RAM speeds (going from my normal OC of 2800 MHz down to the stock 2400 MHz)
- Setting everything that has the word "disable" in the option inside the AMD CBS
- Disabling various features related to the memory subsystem (encryption, scrubbing, ...)
- Booting with `nopti nospectre_v2 nospec_store_bypass_disable`
- Updating my BIOS from version F2b to version F10, which includes an update
  for the AGESA (no change related to this issue, but I did run into #196683
  after doing so, so I went back to F2b for subsequent tests)
- Enabling/Forcing/Disabling transparent hugepages
- Enabling CONFIG_RCU_NOCB_CPU and CONFIG_RCU_BOOST
- Changing the IRQ affinity mask to exclude a core (or range of cores) from
  IRQ handling while performing the test.
- Changing SLAB vs SLUB

None of the above settings made any noticeable difference concerning this issue.

The one setting I found that did make a difference is setting "Memory
Interleaving" either to "Die" or "Socket", rather than "Channel" which is what
I normally use. This effectively disables NUMA, and makes the system present
itself as if it were a UMA system. (I haven't tested this, but I suspect
compiling linux with CONFIG_NUMA=n might have a similar effect). After doing
so, the latencies I noted down were in the order of 1-5ms, more or less in
line with the results we're getting from Ryzen desktop systems.

Additionally, I *think* I found that playing around with
`/proc/sys/kernel/numa_balancing` can affect the results. In particular,
setting this to `0` immediately before running the test apparently avoided the
extreme latency spikes, however it did not stay this way forever. (Re-running
the test later seemed to reproduce the same latencies as usual) I need to
investigate this behavior more closely before I feel confident making any
statement on the ultimate effect of `numa_balancing`, but it's at least a
variable to play with / keep in mind.

Practical workaround
====================

For those of you seeking a quick and dirty solution, the only work-around to
my issue in practice was setting the default CPU affinity to exclude a single
core (in my case, #7). I did this by modifying the following setting in
/etc/systemd/system.conf:

CPUAffinity=0-6,8-22,24-31

I then made sure to isolate all of my realtime-sensitive applications
(particularly my audio player, as well as `jackd` and `pulseaudio`, and
various other threads running with SCHED_FIFO priority) to this isolated CPU
core, by either running them through `taskset --cpu-list 7,23`, or by
modifying the corresponding systemd service file's `CPUAffinity` field, e.g.
the following entry in ~/.config/systemd/user/pulseaudio.service.d/override.conf:

[Service]
CPUAffinity=7,23

As long as nothing else ever runs on these CPU cores, the problem can be
mostly ignored in practice. It's still annoying when everything slows down
and/or my cursor starts stuttering under heavy load (e.g. compiling a big
project), but it's not as annoying as audio underruns.

Affected machines + more results
================================

Included are some more results that I've managed to gather from friendly users
online as well as other machines I have access to, in order to make sure this
problem is not related to some freak hardware defect isolated to my PC, or a
kernel misconfiguration on my part. I've omitted the results verbatim but
included links to them instead, since the relevant excerpts are already given
above; and they mostly just exist to prove that this is a real, observable,
and non-isolated issue that (IMHO) needs to be solved.

1950X+ECC workstation ("Machine A")
-----------------------------------

This is my personal machine, and most of my tests are run on it. I can test
whatever I want using it.

- Motherboard: GIGABYTE X399 DESIGNARE EX-CF (BIOS version F2b)
- CPU: AMD Ryzen Threadripper 1950X
- Memory: 64 GiB DDR Kingston unbuffered ECC 2400 Hz OC to 2800 MHz
- Kernel version: Linux 4.17.8-gentoo-pcireset SMP PREEMPT

I've attached the full kernel configuration of this machine.

Intel i3 workstation ("Machine B")
----------------------------------

This is my work computer, and I can also run whatever test I want on it.

- Motherboard: MSI Z170-A PRO
- CPU: Intel Core i3 6100
- Memory:  16 GiB DDR4 at 2200 MHz

I have two series of tests from this machine, one with `-generic` and the
other with `-lowlatency`:

- Kernel version: Linux 4.4.0-131-lowlatency SMP PREEMPT (Ubuntu 16.04 LTS)
- Results under no load: https://0x0.st/sWB7.txt
- 5 minutes `hackbench`: https://0x0.st/sWBF.txt
- 1 minute `stressapptest`: https://0x0.st/sWBC.txt
- ? minutes `stress`: https://0x0.st/s41H.txt

Alternatively:
- Kernel version: Linux 4.4.0-131-generic SMP (Ubuntu 16.04 LTS)
- ? minutes `stress`: https://0x0.st/s41i.txt

The kernel configurations are available in the ubuntu repositories.


Old dual Xeon server ("Machine C")
----------------------------------

This is a machine I have access to, but can't do invasive testing with, so I'm
sticking to the default `generic` kernel into which it was booted.

- Motherboard: Supermicro X7DWN+
- CPU: 2x Intel Xeon E5410s
- Memory: 32 GiB ECC DDR3 at 666 MHz
- Kernel version: Linux 4.4.0-130-generic SMP (Ubuntu 16.04 LTS)
- No load: https://0x0.st/sWBY.txt
- 10 minutes `hackbench`: https://0x0.st/sWB6.txt
- 1 minute `stressapptest`: https://0x0.st/sWBU.txt


IRC contact's Zen+ ("Machine D")
--------------------------------

This is a machine from user \x on ##amd, who was friendly enough to help
provide zen+ measurements, thus confirming that the new generation of Ryzen
CPUs also seems to be affected.

- Motherboard: Gigabyte X470 AORUS GAMING 5 WIFI-CF
- CPU: AMD Ryzen 7 2700X
- Memory: 16 GiB Kingston DDR4-2400 running at 3200 MHz
- Kernel version: Linux 2700X 4.17.11-041711-generic #201807280505 SMP Sat Jul 28 09:08:36 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux
- No load: https://0x0.st/sJB6.txt
- CPU load: https://0x0.st/sJBI.txt
- VM load (`stress`): https://0x0.st/sJBl.txt


IRC contact's 1950X server
--------------------------

This is a setup fairly similar to mine (i.e. "Machine A"), but it's a good
confirmation to make sure I don't have bad hardware. Sadly, the paste to the
actual results seems to have expired, so I don't have them; but the numbers
were essentially identical to my own system.


IRC contact's Ryzen desktop
---------------------------

This is another test contributed by user Highdude on ##amd. I don't have the
exact machine specifications.

- CPU: Unknown first generation Ryzen desktop CPU (8-core), presumably 1800X
- Memory: Unknown amount of XMP DDR4 memory at 3466 MHz
- Several minutes of `hackbench`: https://0x0.st/sWmE.txt

External links
==============

For the record, some links to discussions/information related to this issue:
- https://old.reddit.com/r/Amd/comments/8wahuy/july_tech_support_megathread/e3aot8z/
- https://www.phoronix.com/forums/forum/phoronix/latest-phoronix-articles/1040577-amd-threadripper-2990wx-linux-benchmarks-the-32-core-64-thread-beast?p=1040701#post1040701
Comment 1 Niklas Haas 2018-08-14 03:41:56 UTC
I can also provide you with some additional information regarding "Machine A":

$ cat /proc/cmdline
BOOT_IMAGE=/vmlinuz-4.17.8-gentoo-pcireset root=ZFS=yggdrasil/ROOT/gentoo ro dozfs initrd=/boot/intel-ucode.img pci=nommconf

The `pci=nommconf` works around an issue where the AMD GPU otherwise spams dmesg with messages about PCIe link errors. (This is a common/widespread issue, and not related to a hardware defect on my particular unit)

$ uname -a
Linux xor 4.17.8-gentoo-pcireset #6 SMP PREEMPT Fri Jul 27 04:03:29 CEST 2018 x86_64 AMD Ryzen Threadripper 1950X 16-Core Processor AuthenticAMD Linux

The `-pcireset` suffix in the version string refers to this patch, which I use to work around an issue with VGA passthrough: https://patchwork.kernel.org/patch/10181903/

$ lsmod
Module                  Size  Used by
wireguard             208896  0
ip6_udp_tunnel         16384  1 wireguard
udp_tunnel             16384  1 wireguard
af_packet              49152  4
snd_hda_codec_realtek   102400  1
snd_hda_codec_generic    81920  2 snd_hda_codec_realtek
snd_hda_intel          36864  4
snd_hda_codec         131072  3 snd_hda_codec_generic,snd_hda_intel,snd_hda_codec_realtek
snd_hda_core           81920  4 snd_hda_codec_generic,snd_hda_intel,snd_hda_codec,snd_hda_codec_realtek
kvm_amd               106496  0
ccp                    86016  1 kvm_amd
sp5100_tco             16384  0
i2c_piix4              24576  0
k10temp                16384  0
xfs                  1351680  2
zfs                  3493888  27
zunicode              335872  1 zfs
zavl                   16384  1 zfs
icp                   270336  1 zfs
zcommon                65536  1 zfs
znvpair                61440  2 zfs,zcommon
spl                    94208  4 zfs,icp,znvpair,zcommon
amdgpu               1597440  47
chash                  16384  1 amdgpu
backlight              16384  1 amdgpu
gpu_sched              28672  1 amdgpu
ttm                   118784  1 amdgpu
vfio_pci               53248  0
vfio_virqfd            16384  1 vfio_pci
vfio_iommu_type1       28672  0
vfio                   32768  2 vfio_iommu_type1,vfio_pci

I also forgot to mention that I see nothing in `dmesg` while the issue is being reproduced. (Although I once witnessed an event where the CPU stalled for so long that the kernel presumably didn't manage to answer the USB audio interface in time, which resulted in a bunch of errors by `xhci_hcd` and required me to unplug and replug the affected USB device before it would resume normal operation)
Comment 2 Niklas Haas 2018-08-14 04:17:02 UTC
Created attachment 277839 [details]
/proc/cpuinfo - Machine A
Comment 3 Niklas Haas 2018-08-14 04:17:20 UTC
Created attachment 277841 [details]
/proc/meminfo - Machine A
Comment 4 Niklas Haas 2018-08-14 04:17:47 UTC
Created attachment 277843 [details]
/proc/iomem - Machine A
Comment 5 Niklas Haas 2018-08-14 04:18:11 UTC
Created attachment 277845 [details]
/proc/ioports - Machine A
Comment 6 Niklas Haas 2018-08-14 04:18:40 UTC
Created attachment 277847 [details]
/proc/zoneinfo - Machine A
Comment 7 Niklas Haas 2018-08-14 04:18:58 UTC
Created attachment 277849 [details]
/proc/buddyinfo - Machine A
Comment 8 Niklas Haas 2018-08-14 04:19:16 UTC
Created attachment 277851 [details]
lspci -vvv
Comment 9 Niklas Haas 2018-08-14 04:22:00 UTC
Created attachment 277853 [details]
/proc/vmstat - Machine A
Comment 10 Niklas Haas 2018-08-14 04:23:13 UTC
Created attachment 277855 [details]
/proc/softirqs - Machine A
Comment 11 Niklas Haas 2018-08-14 04:24:54 UTC
Created attachment 277857 [details]
dmidecode - Machine A
Comment 12 Niklas Haas 2018-08-19 03:11:06 UTC
There's another issue I noticed that I can only reproduce on the threadripper platform, which may or may not be related since it's also an issue that appears to be connected to NUMA balancing.

Unfortunately I don't have a good minimal reproducer for it, but the bug I noticed is that when quitting `mpv` after it has buffered a lot of data into its internal demuxer cache, it can take a good minute or two for mpv to free() all of that memory afterwards, even if it only allocated around 500 MB. If I attach `perf` to mpv during this time period, 99.9% of the time is spent inside page faults, even though this system has no swap or zram/zswap. It also still happens even if I use mlockall() to force all current and future memory to remain paged in, and moving the binary + all mapped libraries to a tmpfs makes no change. Turning the demuxer cache off also avoids the issue, so in summary it's almost surely connected to the memory allocations mpv makes.

I can work around the issue by using `numactl` to force memory allocations to come from only one node. (e.g. `numactl -m0`. Using `numactl -l` or `numactl -N0` alone does not help. Combining the two helps, but does not appear to completely eliminate the issue (which I don't understand - shouldn't that be the same as `-m0`?). Most curiously, forcing `-N0 -m1` works just as well as forcing `-N0 -m0`. So it appears to be related primarily to there being memory *allocated* on multiple nodes, rather than the memory being *accessed* across nodes.

Most importantly, turning off NUMA balancing completely eliminates this issue. More specifically, I can observe the following cases:

1. Disable NUMA balancing
2. Start mpv and let it buffer
3. Quit mpv -> quits instantly

1. Enable NUMA balancing
2. Start mpv and let it buffer
3. Disable NUMA balancing
4. Quit mpv -> quits instantly

1. Enable NUMA balancing
2. Start mpv and let it buffer
3. Quit mpv -> quits after 30-60 seconds

So it only actually matters for the free() loop during program shutdown. NUMA balancing being turned on during this loop causes it to take an order of magnitude or two longer than it should; even though this loop is not making any allocations - just freeing. It's possible that some NUMA balancing heuristic in Linux is undertuned for this scenario?

I haven't noticed this issue affect any other program, so it must be something about the way mpv uses its memory that causes the issue. Perhaps it's connected to the fact that mpv spawns a lot of decoder threads that access this memory, and these decoder threads can be spread across both NUMA nodes. (Typically, mpv would spawn 16 threads, so presumably the scheduler would try to keep 8 on each node, since that's how many physical cores there are per node) The performance during usage is fine, it's only the free() loop during the quit stage that causes the intense slowdown. I tried writing my own program that just allocates a bunch of memory and then free()s it, but I cannot reproduce the same issue.

I'm not sure if I should open a separate ticket for this or not. It's possible there's a connected cause here somehow, i.e. some sort of systematic deficiency in the threadripper NUMA implementation and its interaction with the way Linux handles NUMA placement / balancing; so I figured having more information in the same place is better for finding the root cause.

---------------------------------------------------------------

In case you want to try reproducing it, this is the command line I used for my test cases, and examples of the output. You could replace the file by another one, but to reproduce the bug consistently and quickly it's good to use a file with a high bitrate, which can be buffered freely (i.e. no segmented files like HLS streams) and is served by a server with little restriction on transfer speed. If your network connection is not fast enough to cause mpv to buffer ~500 MB of data within this minute, you may need to adjust the duration of the test. Trying to reproduce it with local files seems to be difficult for some reason, possibly because mpv either doesn't want to cache them or because it fills the memory in too quickly.

# echo 0 > /proc/sys/kernel/numa_balancing
$ time mpv --no-config --demuxer-seekable-cache=yes --demuxer-max-bytes=2000000000 --no-video --ao null -end 0:60 https://www.twitch.tv/videos/289176308
...
7.62s user 1.83s system 14% cpu 1:03.48 total

# echo 1 > /proc/sys/kernel/numa_balancing
$ time mpv --no-config --demuxer-seekable-cache=yes --demuxer-max-bytes=2000000000 --no-video --ao null -end 0:60 https://www.twitch.tv/videos/289176308
...
7.54s user 70.45s system 59% cpu 2:11.54 total

# echo 1 > /proc/sys/kernel/numa_balancing
$ time numactl -m0 -- mpv --no-config --demuxer-seekable-cache=yes --demuxer-max-bytes=2000000000 --no-video --ao null -end 0:60 https://www.twitch.tv/videos/289176308
...
8.06s user 1.82s system 15% cpu 1:03.95 total

As you can see, running with NUMA balancing enabled causes it to take ~70 seconds extra of system time, all of which happens during late shutdown. (After mpv has printed "Exiting..." and is freeing memory). The extra time needed varies sporadically, sometimes only taking 20-30 seconds and sometimes taking as high as 2 minutes.

I've attached a `perf record` sample of around 10-20 seconds taken in the middle of the slow free() loop.
Comment 13 Niklas Haas 2018-08-19 03:11:52 UTC
Created attachment 277933 [details]
`perf record` sample taken during mpv free() loop
Comment 14 Niklas Haas 2018-08-30 17:04:13 UTC
I updated to 4.18.5 and disabled NUMA balancing while I was at it. Problem seems to be significantly improved, but still pretty bad. It might be the upgrade to 4.18 but I suspect it was disabling NUMA balancing that improved it. Needs more testing to confirm.

It might very well be that there's nothing wrong with Linux's NUMA balancing algorithm, just that it interacts poorly with some bug in the threadripper architecture to produce these massive lag spikes.

On "normal" loads (e.g. heavy compilation) I only get 2-3ms latency spikes after upgrade to the NUMA-balancing-disabled 4.18 kernel. Still way too high for a PREEMPT system, but not as high as it used to be. `stressapptest` still kills the system, causing massive slowdown and lag spikes as high as ~50ms after a few seconds of activation. Which, granted, is better by a few orders of magnitude - but the problem isn't gone.
Comment 15 Niklas Haas 2018-10-24 06:29:34 UTC
The combination of AGESA 1.1.0.1a and Linux 4.18.16 seems to have fixed this issue, both for me and for others experiencing the same.

As such, this annoying and awful bug can finally be closed. Hooray! \o/

Note You need to log in before you can comment on or make changes to this bug.