Created attachment 277837 [details] /proc/config.gz - Machine A As an early disclaimer, this is a bug that may or may not be hardware-related and/or kernel-related. But in lieu of a way to directly contact AMD engineers, and in the hope that the kernel might be able to do *something* to improve the situation, as well as to gather information and gain exposure, I'm submitting it here. Bug description =============== On Ryzen systems, scheduling latency seems abnormally high when the system is under memory pressure, with the threadripper line of CPUs being particularly affected. By "high" I mean values of >10ms and even hitting 100ms-1000ms under very heavy load. Reproduction steps ================== 1. Run `cyclictest --numa -p90 -m`, or any other low-latency application (e.g. JACK, PulseAudio, ...). This is a tool contained in the `rt-tests` repository (hosted on git.kernel.org). 2. Put the system under heavy memory load. The best way I have found to reproduce the issue is using `stressapptest` (https://github.com/stressapptest/stressapptest), e.g. with `stressapptest -M 256 -m 32 -s 60`. An alternative is `stress`, e.g. `stress --vm 32 --vm-bytes 256M`, or even the `hackbench` included in `rt-tests`. Alternatively, compile a large C project. (This is how I originally discovered the issue). I will point out that in the tests below I have used different methods of generating load for different machines, but the results are relatively comparable no matter what method of generating load I use. The only difference between the testing methods that I noticed is how soon and how frequently the latency spikes occur. 3. Observe the 'maximum latency' column of `cyclictest`'s output. Actual result ============= The observed latencies are extremely high, even on a kernel built with CONFIG_PREEMPT=y. Latencies go well into the 100ms range, and can even hit 1000ms. For example, an output may look like this (after running `stressapptest` for as little as 10 seconds): # /dev/cpu_dma_latency set to 0us policy: fifo: loadavg: 9.28 6.91 4.36 1/1483 5765 T: 0 (28043) P:98 I:1000 C: 149780 Min: 2 Act: 3 Avg: 26 Max: 68807 T: 1 (28044) P:98 I:1500 C: 99853 Min: 2 Act: 4 Avg: 38 Max: 68926 T: 2 (28045) P:98 I:2000 C: 74890 Min: 2 Act: 6 Avg: 49 Max: 68245 T: 3 (28046) P:98 I:2500 C: 59887 Min: 2 Act: 7 Avg: 40 Max: 62970 T: 4 (28047) P:98 I:3000 C: 49874 Min: 2 Act: 8 Avg: 39 Max: 105782 T: 5 (28048) P:98 I:3500 C: 42776 Min: 2 Act: 5 Avg: 37 Max: 68180 T: 6 (28049) P:98 I:4000 C: 37428 Min: 2 Act: 4 Avg: 33 Max: 75606 T: 7 (28050) P:98 I:4500 C: 33299 Min: 2 Act: 4 Avg: 30 Max: 1875 T: 8 (28051) P:98 I:5000 C: 29967 Min: 2 Act: 6 Avg: 47 Max: 7764 T: 9 (28052) P:98 I:5500 C: 27245 Min: 2 Act: 7 Avg: 44 Max: 5314 T:10 (28053) P:98 I:6000 C: 24974 Min: 3 Act: 4 Avg: 49 Max: 740 T:11 (28054) P:98 I:6500 C: 23053 Min: 2 Act: 7 Avg: 53 Max: 434 T:12 (28055) P:98 I:7000 C: 21406 Min: 3 Act: 4 Avg: 53 Max: 2973 T:13 (28056) P:98 I:7500 C: 19979 Min: 3 Act: 3 Avg: 56 Max: 4425 T:14 (28057) P:98 I:8000 C: 18730 Min: 2 Act: 8 Avg: 57 Max: 596 T:15 (28058) P:98 I:8500 C: 17628 Min: 2 Act: 6 Avg: 70 Max: 5369 T:16 (28059) P:98 I:9000 C: 16642 Min: 2 Act: 6 Avg: 44 Max: 63759 T:17 (28060) P:98 I:9500 C: 15767 Min: 2 Act: 7 Avg: 43 Max: 62685 T:18 (28061) P:98 I:10000 C: 14978 Min: 2 Act: 6 Avg: 31 Max: 64241 T:19 (28062) P:98 I:10500 C: 14265 Min: 3 Act: 4 Avg: 43 Max: 55385 T:20 (28063) P:98 I:11000 C: 13610 Min: 2 Act: 4 Avg: 41 Max: 58667 T:21 (28064) P:98 I:11500 C: 13026 Min: 2 Act: 6 Avg: 35 Max: 38604 T:22 (28065) P:98 I:12000 C: 12482 Min: 2 Act: 3 Avg: 24 Max: 65180 T:23 (28066) P:98 I:12500 C: 11987 Min: 3 Act: 4 Avg: 21 Max: 301 T:24 (28067) P:98 I:13000 C: 11526 Min: 3 Act: 7 Avg: 43 Max: 1734 T:25 (28068) P:98 I:13500 C: 11099 Min: 3 Act: 4 Avg: 61 Max: 3004 T:26 (28069) P:98 I:14000 C: 10703 Min: 3 Act: 4 Avg: 58 Max: 647 T:27 (28070) P:98 I:14500 C: 10334 Min: 2 Act: 5 Avg: 48 Max: 629 T:28 (28071) P:98 I:15000 C: 9989 Min: 3 Act: 5 Avg: 54 Max: 6032 T:29 (28072) P:98 I:15500 C: 9667 Min: 2 Act: 4 Avg: 51 Max: 3003 T:30 (28073) P:98 I:16000 C: 9365 Min: 3 Act: 7 Avg: 55 Max: 354 T:31 (28074) P:98 I:16500 C: 9081 Min: 3 Act: 7 Avg: 87 Max: 6409 This was run on my personal machine ("Machine A"), built around a 1950X. On a non-NUMA Ryzen system, the numbers are significantly lower, but still higher than they should be. For example, typical Ryzen 2700X desktop system ("Machine D") looks something like this: # /dev/cpu_dma_latency set to 0us policy: fifo: loadavg: 13.88 6.81 4.45 20/1356 7555 T: 0 ( 7394) P:90 I:1000 C: 62225 Min: 1 Act: 144 Avg: 192 Max: 18190 T: 1 ( 7395) P:90 I:1500 C: 42300 Min: 1 Act: 37 Avg: 253 Max: 17808 T: 2 ( 7396) P:90 I:2000 C: 31767 Min: 1 Act: 15 Avg: 342 Max: 17449 T: 3 ( 7397) P:90 I:2500 C: 25674 Min: 1 Act: 20 Avg: 406 Max: 17945 T: 4 ( 7398) P:90 I:3000 C: 21678 Min: 1 Act: 26 Avg: 452 Max: 14535 T: 5 ( 7399) P:90 I:3500 C: 18741 Min: 1 Act: 15 Avg: 507 Max: 13813 T: 6 ( 7400) P:90 I:4000 C: 16464 Min: 1 Act: 20 Avg: 580 Max: 15629 T: 7 ( 7401) P:90 I:4500 C: 14885 Min: 1 Act: 3364 Avg: 582 Max: 17282 T: 8 ( 7402) P:90 I:5000 C: 13492 Min: 1 Act: 5305 Avg: 630 Max: 15512 T: 9 ( 7403) P:90 I:5500 C: 12363 Min: 1 Act: 16 Avg: 665 Max: 16643 T:10 ( 7404) P:90 I:6000 C: 11478 Min: 1 Act: 24 Avg: 658 Max: 13628 T:11 ( 7405) P:90 I:6500 C: 10669 Min: 1 Act: 37 Avg: 721 Max: 14851 T:12 ( 7406) P:90 I:7000 C: 9992 Min: 2 Act: 27 Avg: 704 Max: 12343 T:13 ( 7407) P:90 I:7500 C: 9408 Min: 2 Act: 29 Avg: 732 Max: 13683 T:14 ( 7408) P:90 I:8000 C: 8853 Min: 2 Act: 22 Avg: 671 Max: 13566 T:15 ( 7409) P:90 I:8500 C: 8400 Min: 2 Act: 23 Avg: 715 Max: 13756 Expected result =============== On various Intel systems I have tested, the latencies typically stay well below 10ms on a kernel built with CONFIG_PREEMPT_NONE=y. An example result, from an intel desktop i3 system ("Machine B") running Ubuntu 16.04 with a stock `-generic` kernel: ---# ./cyclictest --smp -p90 -m # /dev/cpu_dma_latency set to 0us policy: fifo: loadavg: 4.26 3.11 1.44 1/619 14414 T: 0 (14118) P:90 I:1000 C: 305345 Min: 1 Act: 1 Avg: 77 Max: 3212 T: 1 (14119) P:90 I:1500 C: 209376 Min: 1 Act: 2 Avg: 83 Max: 4062 T: 2 (14120) P:90 I:2000 C: 158013 Min: 1 Act: 1 Avg: 87 Max: 5174 T: 3 (14121) P:90 I:2500 C: 126484 Min: 1 Act: 2 Avg: 86 Max: 2814 The results are (obviously) much better on a kernel built with CONFIG_PREEMPT=y, such as these results from the same system using the ubuntu `-realtime` kernel: # /dev/cpu_dma_latency set to 0us policy: fifo: loadavg: 4.61 3.09 1.38 1/582 4642 T: 0 ( 4360) P:90 I:1000 C: 280040 Min: 1 Act: 2 Avg: 8 Max: 252 T: 1 ( 4361) P:90 I:1500 C: 186655 Min: 1 Act: 1 Avg: 9 Max: 708 T: 2 ( 4362) P:90 I:2000 C: 139990 Min: 1 Act: 2 Avg: 11 Max: 264 T: 3 ( 4363) P:90 I:2500 C: 111992 Min: 1 Act: 1 Avg: 10 Max: 260 These are latencies I'd consider "reasonable" for a budget desktop system under extreme load. These numbers were obtained by using `stress --vm 4 --vm-bytes 256M` at the same time as `cyclictest --smp -p90 -m`. To give a comparison to a typical NUMA system, I have measured this relatively old (E5410s) dual Xeon setup ("Machine C"), using a non-preemptable kernel (also running Ubuntu 16.04): # /dev/cpu_dma_latency set to 0us policy: fifo: loadavg: 20.20 29.30 31.89 1/1704 10951 T: 0 ( 6624) P:90 I:1000 C: 76576 Min: 4 Act: 8 Avg: 12 Max: 1572 T: 1 ( 6625) P:90 I:1500 C: 51052 Min: 4 Act: 6 Avg: 14 Max: 1241 T: 2 ( 6626) P:90 I:2000 C: 38289 Min: 5 Act: 8 Avg: 14 Max: 1734 T: 3 ( 6628) P:90 I:2500 C: 30631 Min: 5 Act: 8 Avg: 17 Max: 1260 T: 4 ( 6629) P:90 I:3000 C: 25526 Min: 5 Act: 7 Avg: 19 Max: 2182 T: 5 ( 6631) P:90 I:3500 C: 21879 Min: 4 Act: 8 Avg: 23 Max: 685 T: 6 ( 6633) P:90 I:4000 C: 19144 Min: 5 Act: 7 Avg: 22 Max: 3776 T: 7 ( 6634) P:90 I:4500 C: 17017 Min: 5 Act: 8 Avg: 22 Max: 938 These numbers are high, but also within the range of what I'd expect for a non-preemptable kernel. This result was obtained using `stress --vm 8 --vm-bytes 256M`. Notes on reproduction / causes ============================== There are several factors determining both the magnitude and presence of latency spikes. The first factor is that the machine has to be under heavy memory load in general. Loading a single CPU core is not enough, and does not appear to correlate with a measurable increase in latencies even on the loaded core. On a more-or-less idle system, my numbers are fine (latencies in the 1-20μs range, as I'd expect for a PREEMPT kernel). Doing purely computational work (e.g. data fits in L2) on the CPUs does not trigger the issue, even if the system is fully loaded this way. I/O load does not trigger the issue, and the issue still occurs even with no block devices mounted. The second aspect is that the latencies only increase on the loaded cores. For example, if I use CPU affinities to exclude a single core (#7 in this case) from running anything other than `cyclictest`, I get a result that looks like this, after around 5 minutes of running `hackbench`: # /dev/cpu_dma_latency set to 0us policy: fifo: loadavg: 185.30 109.66 47.59 164/1953 19143 T: 0 ( 2678) P:98 I:1000 C: 260477 Min: 2 Act: 6 Avg: 7 Max: 244128 T: 1 ( 2679) P:98 I:1500 C: 173854 Min: 2 Act: 4 Avg: 7 Max: 91214 T: 2 ( 2680) P:98 I:2000 C: 130348 Min: 2 Act: 5 Avg: 9 Max: 80307 T: 3 ( 2681) P:98 I:2500 C: 104261 Min: 2 Act: 6 Avg: 10 Max: 82710 T: 4 ( 2682) P:98 I:3000 C: 86907 Min: 2 Act: 6 Avg: 10 Max: 68245 T: 5 ( 2683) P:98 I:3500 C: 74459 Min: 2 Act: 6 Avg: 13 Max: 129451 T: 6 ( 2684) P:98 I:4000 C: 65092 Min: 3 Act: 7 Avg: 18 Max: 133967 T: 7 ( 2685) P:98 I:4500 C: 58032 Min: 2 Act: 6 Avg: 7 Max: 372 T: 8 ( 2686) P:98 I:5000 C: 52165 Min: 2 Act: 8 Avg: 12 Max: 68300 T: 9 ( 2687) P:98 I:5500 C: 47470 Min: 2 Act: 6 Avg: 7 Max: 25042 T:10 ( 2688) P:98 I:6000 C: 43479 Min: 3 Act: 5 Avg: 12 Max: 90672 T:11 ( 2689) P:98 I:6500 C: 40166 Min: 3 Act: 4 Avg: 7 Max: 30386 T:12 ( 2690) P:98 I:7000 C: 37293 Min: 3 Act: 8 Avg: 8 Max: 50129 T:13 ( 2691) P:98 I:7500 C: 34799 Min: 3 Act: 6 Avg: 11 Max: 84115 T:14 ( 2692) P:98 I:8000 C: 32605 Min: 3 Act: 5 Avg: 16 Max: 1399177 T:15 ( 2693) P:98 I:8500 C: 30720 Min: 3 Act: 6 Avg: 8 Max: 22138 T:16 ( 2694) P:98 I:9000 C: 28998 Min: 2 Act: 8 Avg: 12 Max: 50514 T:17 ( 2695) P:98 I:9500 C: 27487 Min: 3 Act: 6 Avg: 7 Max: 12135 T:18 ( 2696) P:98 I:10000 C: 26100 Min: 3 Act: 5 Avg: 12 Max: 1194399 T:19 ( 2697) P:98 I:10500 C: 24863 Min: 3 Act: 5 Avg: 9 Max: 81541 T:20 ( 2698) P:98 I:11000 C: 23737 Min: 3 Act: 5 Avg: 8 Max: 41699 T:21 ( 2699) P:98 I:11500 C: 22706 Min: 3 Act: 5 Avg: 6 Max: 17104 T:22 ( 2700) P:98 I:12000 C: 21758 Min: 3 Act: 4 Avg: 8 Max: 44769 T:23 ( 2701) P:98 I:12500 C: 20891 Min: 2 Act: 8 Avg: 7 Max: 144 T:24 ( 2702) P:98 I:13000 C: 20087 Min: 3 Act: 4 Avg: 5 Max: 42 T:25 ( 2703) P:98 I:13500 C: 19343 Min: 3 Act: 5 Avg: 6 Max: 51 T:26 ( 2704) P:98 I:14000 C: 18652 Min: 3 Act: 5 Avg: 5 Max: 144 T:27 ( 2705) P:98 I:14500 C: 18009 Min: 3 Act: 6 Avg: 5 Max: 34 T:28 ( 2706) P:98 I:15000 C: 17409 Min: 3 Act: 5 Avg: 6 Max: 39 T:29 ( 2707) P:98 I:15500 C: 16847 Min: 3 Act: 9 Avg: 5 Max: 54 T:30 ( 2708) P:98 I:16000 C: 16321 Min: 3 Act: 6 Avg: 6 Max: 36 T:31 ( 2709) P:98 I:16500 C: 15826 Min: 3 Act: 5 Avg: 8 Max: 866 The extreme results on the threadripper systems are also seemingly linked to the system's NUMA architecture. In particular, if I constrain my load to a single NUMA node (e.g. #0), I get lower results, more in line with what you see from the desktop ryzen CPUs. However, loading the nodes independently (e.g. with two instances of `stressapptest`, each constrained to a single node), also didn't appear to trigger the insanely high latencies; so I suspect it has something to do with simultaneous cross-node memory access. (*) I also tried running `stressapptest` on node 0, but constraining it to only allocate from node 1, so every memory access is cross-node. This also did not reproduce the extreme latency spikes. Finally, I tried running a test designed to stress the page walking hardware: https://github.com/graphitemaster/xcpumemperf This did not appear to reproduce the issue, despite ostensibly also doing a similar thing as `stress` and `stressapptest`. In an effort to track down the cause, I tried running the ftrace mechanism's `hwlat` and `preempt` tracers while reproducing the issue, to see if there were any unusually high hardware latencies or slow kernel functions being run, but I observed nothing above 10μs. (I also must point out that my own kernel is tainted by `zfs.ko`, but unloading this did not appear to make any difference, nor did I see any slow kernel functions that could be related to it. Finally, the problem does not appear to be I/O-related in general) (*) This paragraph needs to be taken with some grain of salt, since I did not repeat these tests often enough to be confident in the results. I will repeat them when I get the time, if need be. Notes on the symptoms ===================== This may or may not be relevant, but one thing I noticed is that the latency spikes seem to happen at around the same time, and usually only after a short delay. For example, when running `hackbench`, the latencies will be absolutely fine for 1-2 minutes, until they suddenly all start skyrocketing at the same time. It's as if some component common to all of the CPU cores involved in the test is triggering the latency spikes more or less simultaneously. The extent and presence of the symptoms can also change over time as multiple tests are performed, indicating that there might be some element of statefulness to the problem's cause (memory fragmentation? something to do with page table entries? ...). I tried playing around with /proc/sys/vm/compact_memory, but the results were inconclusive. Another thing that's incredibly obvious is that in addition to the high measurement results, the system becomes noticeably slow to respond to any user input while the problem is being reproduced. After switching focus from one terminal to another it can take a good second to re-paint the contents. This is especially noticeable while `stressapptest` is running. Under the scenarios described above that do *not* reproduce the issue (but still put the system under high load), I do not observe this effect at all. Things I've tested ================== I've gone through a long list of ideas in trying to both identify the cause of and find a practical work around to this issue: - Booting from a LiveCD (different kernel, no I/O) - Using an RT-patched kernel (this reduced latencies on an idle system, but had no effect under load) - Disabling various BIOS options related to the CPU, including: * Frequency scaling * Adaptive Overclocking * C-States * Cool & Quiet * SMT * Possibly others I have forgotten - Disabling the IOMMU - Modifying the RAM speeds (going from my normal OC of 2800 MHz down to the stock 2400 MHz) - Setting everything that has the word "disable" in the option inside the AMD CBS - Disabling various features related to the memory subsystem (encryption, scrubbing, ...) - Booting with `nopti nospectre_v2 nospec_store_bypass_disable` - Updating my BIOS from version F2b to version F10, which includes an update for the AGESA (no change related to this issue, but I did run into #196683 after doing so, so I went back to F2b for subsequent tests) - Enabling/Forcing/Disabling transparent hugepages - Enabling CONFIG_RCU_NOCB_CPU and CONFIG_RCU_BOOST - Changing the IRQ affinity mask to exclude a core (or range of cores) from IRQ handling while performing the test. - Changing SLAB vs SLUB None of the above settings made any noticeable difference concerning this issue. The one setting I found that did make a difference is setting "Memory Interleaving" either to "Die" or "Socket", rather than "Channel" which is what I normally use. This effectively disables NUMA, and makes the system present itself as if it were a UMA system. (I haven't tested this, but I suspect compiling linux with CONFIG_NUMA=n might have a similar effect). After doing so, the latencies I noted down were in the order of 1-5ms, more or less in line with the results we're getting from Ryzen desktop systems. Additionally, I *think* I found that playing around with `/proc/sys/kernel/numa_balancing` can affect the results. In particular, setting this to `0` immediately before running the test apparently avoided the extreme latency spikes, however it did not stay this way forever. (Re-running the test later seemed to reproduce the same latencies as usual) I need to investigate this behavior more closely before I feel confident making any statement on the ultimate effect of `numa_balancing`, but it's at least a variable to play with / keep in mind. Practical workaround ==================== For those of you seeking a quick and dirty solution, the only work-around to my issue in practice was setting the default CPU affinity to exclude a single core (in my case, #7). I did this by modifying the following setting in /etc/systemd/system.conf: CPUAffinity=0-6,8-22,24-31 I then made sure to isolate all of my realtime-sensitive applications (particularly my audio player, as well as `jackd` and `pulseaudio`, and various other threads running with SCHED_FIFO priority) to this isolated CPU core, by either running them through `taskset --cpu-list 7,23`, or by modifying the corresponding systemd service file's `CPUAffinity` field, e.g. the following entry in ~/.config/systemd/user/pulseaudio.service.d/override.conf: [Service] CPUAffinity=7,23 As long as nothing else ever runs on these CPU cores, the problem can be mostly ignored in practice. It's still annoying when everything slows down and/or my cursor starts stuttering under heavy load (e.g. compiling a big project), but it's not as annoying as audio underruns. Affected machines + more results ================================ Included are some more results that I've managed to gather from friendly users online as well as other machines I have access to, in order to make sure this problem is not related to some freak hardware defect isolated to my PC, or a kernel misconfiguration on my part. I've omitted the results verbatim but included links to them instead, since the relevant excerpts are already given above; and they mostly just exist to prove that this is a real, observable, and non-isolated issue that (IMHO) needs to be solved. 1950X+ECC workstation ("Machine A") ----------------------------------- This is my personal machine, and most of my tests are run on it. I can test whatever I want using it. - Motherboard: GIGABYTE X399 DESIGNARE EX-CF (BIOS version F2b) - CPU: AMD Ryzen Threadripper 1950X - Memory: 64 GiB DDR Kingston unbuffered ECC 2400 Hz OC to 2800 MHz - Kernel version: Linux 4.17.8-gentoo-pcireset SMP PREEMPT I've attached the full kernel configuration of this machine. Intel i3 workstation ("Machine B") ---------------------------------- This is my work computer, and I can also run whatever test I want on it. - Motherboard: MSI Z170-A PRO - CPU: Intel Core i3 6100 - Memory: 16 GiB DDR4 at 2200 MHz I have two series of tests from this machine, one with `-generic` and the other with `-lowlatency`: - Kernel version: Linux 4.4.0-131-lowlatency SMP PREEMPT (Ubuntu 16.04 LTS) - Results under no load: https://0x0.st/sWB7.txt - 5 minutes `hackbench`: https://0x0.st/sWBF.txt - 1 minute `stressapptest`: https://0x0.st/sWBC.txt - ? minutes `stress`: https://0x0.st/s41H.txt Alternatively: - Kernel version: Linux 4.4.0-131-generic SMP (Ubuntu 16.04 LTS) - ? minutes `stress`: https://0x0.st/s41i.txt The kernel configurations are available in the ubuntu repositories. Old dual Xeon server ("Machine C") ---------------------------------- This is a machine I have access to, but can't do invasive testing with, so I'm sticking to the default `generic` kernel into which it was booted. - Motherboard: Supermicro X7DWN+ - CPU: 2x Intel Xeon E5410s - Memory: 32 GiB ECC DDR3 at 666 MHz - Kernel version: Linux 4.4.0-130-generic SMP (Ubuntu 16.04 LTS) - No load: https://0x0.st/sWBY.txt - 10 minutes `hackbench`: https://0x0.st/sWB6.txt - 1 minute `stressapptest`: https://0x0.st/sWBU.txt IRC contact's Zen+ ("Machine D") -------------------------------- This is a machine from user \x on ##amd, who was friendly enough to help provide zen+ measurements, thus confirming that the new generation of Ryzen CPUs also seems to be affected. - Motherboard: Gigabyte X470 AORUS GAMING 5 WIFI-CF - CPU: AMD Ryzen 7 2700X - Memory: 16 GiB Kingston DDR4-2400 running at 3200 MHz - Kernel version: Linux 2700X 4.17.11-041711-generic #201807280505 SMP Sat Jul 28 09:08:36 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux - No load: https://0x0.st/sJB6.txt - CPU load: https://0x0.st/sJBI.txt - VM load (`stress`): https://0x0.st/sJBl.txt IRC contact's 1950X server -------------------------- This is a setup fairly similar to mine (i.e. "Machine A"), but it's a good confirmation to make sure I don't have bad hardware. Sadly, the paste to the actual results seems to have expired, so I don't have them; but the numbers were essentially identical to my own system. IRC contact's Ryzen desktop --------------------------- This is another test contributed by user Highdude on ##amd. I don't have the exact machine specifications. - CPU: Unknown first generation Ryzen desktop CPU (8-core), presumably 1800X - Memory: Unknown amount of XMP DDR4 memory at 3466 MHz - Several minutes of `hackbench`: https://0x0.st/sWmE.txt External links ============== For the record, some links to discussions/information related to this issue: - https://old.reddit.com/r/Amd/comments/8wahuy/july_tech_support_megathread/e3aot8z/ - https://www.phoronix.com/forums/forum/phoronix/latest-phoronix-articles/1040577-amd-threadripper-2990wx-linux-benchmarks-the-32-core-64-thread-beast?p=1040701#post1040701
I can also provide you with some additional information regarding "Machine A": $ cat /proc/cmdline BOOT_IMAGE=/vmlinuz-4.17.8-gentoo-pcireset root=ZFS=yggdrasil/ROOT/gentoo ro dozfs initrd=/boot/intel-ucode.img pci=nommconf The `pci=nommconf` works around an issue where the AMD GPU otherwise spams dmesg with messages about PCIe link errors. (This is a common/widespread issue, and not related to a hardware defect on my particular unit) $ uname -a Linux xor 4.17.8-gentoo-pcireset #6 SMP PREEMPT Fri Jul 27 04:03:29 CEST 2018 x86_64 AMD Ryzen Threadripper 1950X 16-Core Processor AuthenticAMD Linux The `-pcireset` suffix in the version string refers to this patch, which I use to work around an issue with VGA passthrough: https://patchwork.kernel.org/patch/10181903/ $ lsmod Module Size Used by wireguard 208896 0 ip6_udp_tunnel 16384 1 wireguard udp_tunnel 16384 1 wireguard af_packet 49152 4 snd_hda_codec_realtek 102400 1 snd_hda_codec_generic 81920 2 snd_hda_codec_realtek snd_hda_intel 36864 4 snd_hda_codec 131072 3 snd_hda_codec_generic,snd_hda_intel,snd_hda_codec_realtek snd_hda_core 81920 4 snd_hda_codec_generic,snd_hda_intel,snd_hda_codec,snd_hda_codec_realtek kvm_amd 106496 0 ccp 86016 1 kvm_amd sp5100_tco 16384 0 i2c_piix4 24576 0 k10temp 16384 0 xfs 1351680 2 zfs 3493888 27 zunicode 335872 1 zfs zavl 16384 1 zfs icp 270336 1 zfs zcommon 65536 1 zfs znvpair 61440 2 zfs,zcommon spl 94208 4 zfs,icp,znvpair,zcommon amdgpu 1597440 47 chash 16384 1 amdgpu backlight 16384 1 amdgpu gpu_sched 28672 1 amdgpu ttm 118784 1 amdgpu vfio_pci 53248 0 vfio_virqfd 16384 1 vfio_pci vfio_iommu_type1 28672 0 vfio 32768 2 vfio_iommu_type1,vfio_pci I also forgot to mention that I see nothing in `dmesg` while the issue is being reproduced. (Although I once witnessed an event where the CPU stalled for so long that the kernel presumably didn't manage to answer the USB audio interface in time, which resulted in a bunch of errors by `xhci_hcd` and required me to unplug and replug the affected USB device before it would resume normal operation)
Created attachment 277839 [details] /proc/cpuinfo - Machine A
Created attachment 277841 [details] /proc/meminfo - Machine A
Created attachment 277843 [details] /proc/iomem - Machine A
Created attachment 277845 [details] /proc/ioports - Machine A
Created attachment 277847 [details] /proc/zoneinfo - Machine A
Created attachment 277849 [details] /proc/buddyinfo - Machine A
Created attachment 277851 [details] lspci -vvv
Created attachment 277853 [details] /proc/vmstat - Machine A
Created attachment 277855 [details] /proc/softirqs - Machine A
Created attachment 277857 [details] dmidecode - Machine A
There's another issue I noticed that I can only reproduce on the threadripper platform, which may or may not be related since it's also an issue that appears to be connected to NUMA balancing. Unfortunately I don't have a good minimal reproducer for it, but the bug I noticed is that when quitting `mpv` after it has buffered a lot of data into its internal demuxer cache, it can take a good minute or two for mpv to free() all of that memory afterwards, even if it only allocated around 500 MB. If I attach `perf` to mpv during this time period, 99.9% of the time is spent inside page faults, even though this system has no swap or zram/zswap. It also still happens even if I use mlockall() to force all current and future memory to remain paged in, and moving the binary + all mapped libraries to a tmpfs makes no change. Turning the demuxer cache off also avoids the issue, so in summary it's almost surely connected to the memory allocations mpv makes. I can work around the issue by using `numactl` to force memory allocations to come from only one node. (e.g. `numactl -m0`. Using `numactl -l` or `numactl -N0` alone does not help. Combining the two helps, but does not appear to completely eliminate the issue (which I don't understand - shouldn't that be the same as `-m0`?). Most curiously, forcing `-N0 -m1` works just as well as forcing `-N0 -m0`. So it appears to be related primarily to there being memory *allocated* on multiple nodes, rather than the memory being *accessed* across nodes. Most importantly, turning off NUMA balancing completely eliminates this issue. More specifically, I can observe the following cases: 1. Disable NUMA balancing 2. Start mpv and let it buffer 3. Quit mpv -> quits instantly 1. Enable NUMA balancing 2. Start mpv and let it buffer 3. Disable NUMA balancing 4. Quit mpv -> quits instantly 1. Enable NUMA balancing 2. Start mpv and let it buffer 3. Quit mpv -> quits after 30-60 seconds So it only actually matters for the free() loop during program shutdown. NUMA balancing being turned on during this loop causes it to take an order of magnitude or two longer than it should; even though this loop is not making any allocations - just freeing. It's possible that some NUMA balancing heuristic in Linux is undertuned for this scenario? I haven't noticed this issue affect any other program, so it must be something about the way mpv uses its memory that causes the issue. Perhaps it's connected to the fact that mpv spawns a lot of decoder threads that access this memory, and these decoder threads can be spread across both NUMA nodes. (Typically, mpv would spawn 16 threads, so presumably the scheduler would try to keep 8 on each node, since that's how many physical cores there are per node) The performance during usage is fine, it's only the free() loop during the quit stage that causes the intense slowdown. I tried writing my own program that just allocates a bunch of memory and then free()s it, but I cannot reproduce the same issue. I'm not sure if I should open a separate ticket for this or not. It's possible there's a connected cause here somehow, i.e. some sort of systematic deficiency in the threadripper NUMA implementation and its interaction with the way Linux handles NUMA placement / balancing; so I figured having more information in the same place is better for finding the root cause. --------------------------------------------------------------- In case you want to try reproducing it, this is the command line I used for my test cases, and examples of the output. You could replace the file by another one, but to reproduce the bug consistently and quickly it's good to use a file with a high bitrate, which can be buffered freely (i.e. no segmented files like HLS streams) and is served by a server with little restriction on transfer speed. If your network connection is not fast enough to cause mpv to buffer ~500 MB of data within this minute, you may need to adjust the duration of the test. Trying to reproduce it with local files seems to be difficult for some reason, possibly because mpv either doesn't want to cache them or because it fills the memory in too quickly. # echo 0 > /proc/sys/kernel/numa_balancing $ time mpv --no-config --demuxer-seekable-cache=yes --demuxer-max-bytes=2000000000 --no-video --ao null -end 0:60 https://www.twitch.tv/videos/289176308 ... 7.62s user 1.83s system 14% cpu 1:03.48 total # echo 1 > /proc/sys/kernel/numa_balancing $ time mpv --no-config --demuxer-seekable-cache=yes --demuxer-max-bytes=2000000000 --no-video --ao null -end 0:60 https://www.twitch.tv/videos/289176308 ... 7.54s user 70.45s system 59% cpu 2:11.54 total # echo 1 > /proc/sys/kernel/numa_balancing $ time numactl -m0 -- mpv --no-config --demuxer-seekable-cache=yes --demuxer-max-bytes=2000000000 --no-video --ao null -end 0:60 https://www.twitch.tv/videos/289176308 ... 8.06s user 1.82s system 15% cpu 1:03.95 total As you can see, running with NUMA balancing enabled causes it to take ~70 seconds extra of system time, all of which happens during late shutdown. (After mpv has printed "Exiting..." and is freeing memory). The extra time needed varies sporadically, sometimes only taking 20-30 seconds and sometimes taking as high as 2 minutes. I've attached a `perf record` sample of around 10-20 seconds taken in the middle of the slow free() loop.
Created attachment 277933 [details] `perf record` sample taken during mpv free() loop
I updated to 4.18.5 and disabled NUMA balancing while I was at it. Problem seems to be significantly improved, but still pretty bad. It might be the upgrade to 4.18 but I suspect it was disabling NUMA balancing that improved it. Needs more testing to confirm. It might very well be that there's nothing wrong with Linux's NUMA balancing algorithm, just that it interacts poorly with some bug in the threadripper architecture to produce these massive lag spikes. On "normal" loads (e.g. heavy compilation) I only get 2-3ms latency spikes after upgrade to the NUMA-balancing-disabled 4.18 kernel. Still way too high for a PREEMPT system, but not as high as it used to be. `stressapptest` still kills the system, causing massive slowdown and lag spikes as high as ~50ms after a few seconds of activation. Which, granted, is better by a few orders of magnitude - but the problem isn't gone.
The combination of AGESA 1.1.0.1a and Linux 4.18.16 seems to have fixed this issue, both for me and for others experiencing the same. As such, this annoying and awful bug can finally be closed. Hooray! \o/