Bug 200811
Summary: | Very high scheduling latency on Ryzen systems | ||
---|---|---|---|
Product: | Platform Specific/Hardware | Reporter: | Niklas Haas (kernel) |
Component: | x86-64 | Assignee: | platform_x86_64 (platform_x86_64) |
Status: | RESOLVED CODE_FIX | ||
Severity: | normal | CC: | cJ-kernel, hadrien.lacour |
Priority: | P1 | ||
Hardware: | x86-64 | ||
OS: | Linux | ||
Kernel Version: | 4.17.8-gentoo-pcireset | Subsystem: | |
Regression: | No | Bisected commit-id: | |
Attachments: |
/proc/config.gz - Machine A
/proc/cpuinfo - Machine A /proc/meminfo - Machine A /proc/iomem - Machine A /proc/ioports - Machine A /proc/zoneinfo - Machine A /proc/buddyinfo - Machine A lspci -vvv /proc/vmstat - Machine A /proc/softirqs - Machine A dmidecode - Machine A `perf record` sample taken during mpv free() loop |
Description
Niklas Haas
2018-08-14 03:36:04 UTC
I can also provide you with some additional information regarding "Machine A": $ cat /proc/cmdline BOOT_IMAGE=/vmlinuz-4.17.8-gentoo-pcireset root=ZFS=yggdrasil/ROOT/gentoo ro dozfs initrd=/boot/intel-ucode.img pci=nommconf The `pci=nommconf` works around an issue where the AMD GPU otherwise spams dmesg with messages about PCIe link errors. (This is a common/widespread issue, and not related to a hardware defect on my particular unit) $ uname -a Linux xor 4.17.8-gentoo-pcireset #6 SMP PREEMPT Fri Jul 27 04:03:29 CEST 2018 x86_64 AMD Ryzen Threadripper 1950X 16-Core Processor AuthenticAMD Linux The `-pcireset` suffix in the version string refers to this patch, which I use to work around an issue with VGA passthrough: https://patchwork.kernel.org/patch/10181903/ $ lsmod Module Size Used by wireguard 208896 0 ip6_udp_tunnel 16384 1 wireguard udp_tunnel 16384 1 wireguard af_packet 49152 4 snd_hda_codec_realtek 102400 1 snd_hda_codec_generic 81920 2 snd_hda_codec_realtek snd_hda_intel 36864 4 snd_hda_codec 131072 3 snd_hda_codec_generic,snd_hda_intel,snd_hda_codec_realtek snd_hda_core 81920 4 snd_hda_codec_generic,snd_hda_intel,snd_hda_codec,snd_hda_codec_realtek kvm_amd 106496 0 ccp 86016 1 kvm_amd sp5100_tco 16384 0 i2c_piix4 24576 0 k10temp 16384 0 xfs 1351680 2 zfs 3493888 27 zunicode 335872 1 zfs zavl 16384 1 zfs icp 270336 1 zfs zcommon 65536 1 zfs znvpair 61440 2 zfs,zcommon spl 94208 4 zfs,icp,znvpair,zcommon amdgpu 1597440 47 chash 16384 1 amdgpu backlight 16384 1 amdgpu gpu_sched 28672 1 amdgpu ttm 118784 1 amdgpu vfio_pci 53248 0 vfio_virqfd 16384 1 vfio_pci vfio_iommu_type1 28672 0 vfio 32768 2 vfio_iommu_type1,vfio_pci I also forgot to mention that I see nothing in `dmesg` while the issue is being reproduced. (Although I once witnessed an event where the CPU stalled for so long that the kernel presumably didn't manage to answer the USB audio interface in time, which resulted in a bunch of errors by `xhci_hcd` and required me to unplug and replug the affected USB device before it would resume normal operation) Created attachment 277839 [details]
/proc/cpuinfo - Machine A
Created attachment 277841 [details]
/proc/meminfo - Machine A
Created attachment 277843 [details]
/proc/iomem - Machine A
Created attachment 277845 [details]
/proc/ioports - Machine A
Created attachment 277847 [details]
/proc/zoneinfo - Machine A
Created attachment 277849 [details]
/proc/buddyinfo - Machine A
Created attachment 277851 [details]
lspci -vvv
Created attachment 277853 [details]
/proc/vmstat - Machine A
Created attachment 277855 [details]
/proc/softirqs - Machine A
Created attachment 277857 [details]
dmidecode - Machine A
There's another issue I noticed that I can only reproduce on the threadripper platform, which may or may not be related since it's also an issue that appears to be connected to NUMA balancing. Unfortunately I don't have a good minimal reproducer for it, but the bug I noticed is that when quitting `mpv` after it has buffered a lot of data into its internal demuxer cache, it can take a good minute or two for mpv to free() all of that memory afterwards, even if it only allocated around 500 MB. If I attach `perf` to mpv during this time period, 99.9% of the time is spent inside page faults, even though this system has no swap or zram/zswap. It also still happens even if I use mlockall() to force all current and future memory to remain paged in, and moving the binary + all mapped libraries to a tmpfs makes no change. Turning the demuxer cache off also avoids the issue, so in summary it's almost surely connected to the memory allocations mpv makes. I can work around the issue by using `numactl` to force memory allocations to come from only one node. (e.g. `numactl -m0`. Using `numactl -l` or `numactl -N0` alone does not help. Combining the two helps, but does not appear to completely eliminate the issue (which I don't understand - shouldn't that be the same as `-m0`?). Most curiously, forcing `-N0 -m1` works just as well as forcing `-N0 -m0`. So it appears to be related primarily to there being memory *allocated* on multiple nodes, rather than the memory being *accessed* across nodes. Most importantly, turning off NUMA balancing completely eliminates this issue. More specifically, I can observe the following cases: 1. Disable NUMA balancing 2. Start mpv and let it buffer 3. Quit mpv -> quits instantly 1. Enable NUMA balancing 2. Start mpv and let it buffer 3. Disable NUMA balancing 4. Quit mpv -> quits instantly 1. Enable NUMA balancing 2. Start mpv and let it buffer 3. Quit mpv -> quits after 30-60 seconds So it only actually matters for the free() loop during program shutdown. NUMA balancing being turned on during this loop causes it to take an order of magnitude or two longer than it should; even though this loop is not making any allocations - just freeing. It's possible that some NUMA balancing heuristic in Linux is undertuned for this scenario? I haven't noticed this issue affect any other program, so it must be something about the way mpv uses its memory that causes the issue. Perhaps it's connected to the fact that mpv spawns a lot of decoder threads that access this memory, and these decoder threads can be spread across both NUMA nodes. (Typically, mpv would spawn 16 threads, so presumably the scheduler would try to keep 8 on each node, since that's how many physical cores there are per node) The performance during usage is fine, it's only the free() loop during the quit stage that causes the intense slowdown. I tried writing my own program that just allocates a bunch of memory and then free()s it, but I cannot reproduce the same issue. I'm not sure if I should open a separate ticket for this or not. It's possible there's a connected cause here somehow, i.e. some sort of systematic deficiency in the threadripper NUMA implementation and its interaction with the way Linux handles NUMA placement / balancing; so I figured having more information in the same place is better for finding the root cause. --------------------------------------------------------------- In case you want to try reproducing it, this is the command line I used for my test cases, and examples of the output. You could replace the file by another one, but to reproduce the bug consistently and quickly it's good to use a file with a high bitrate, which can be buffered freely (i.e. no segmented files like HLS streams) and is served by a server with little restriction on transfer speed. If your network connection is not fast enough to cause mpv to buffer ~500 MB of data within this minute, you may need to adjust the duration of the test. Trying to reproduce it with local files seems to be difficult for some reason, possibly because mpv either doesn't want to cache them or because it fills the memory in too quickly. # echo 0 > /proc/sys/kernel/numa_balancing $ time mpv --no-config --demuxer-seekable-cache=yes --demuxer-max-bytes=2000000000 --no-video --ao null -end 0:60 https://www.twitch.tv/videos/289176308 ... 7.62s user 1.83s system 14% cpu 1:03.48 total # echo 1 > /proc/sys/kernel/numa_balancing $ time mpv --no-config --demuxer-seekable-cache=yes --demuxer-max-bytes=2000000000 --no-video --ao null -end 0:60 https://www.twitch.tv/videos/289176308 ... 7.54s user 70.45s system 59% cpu 2:11.54 total # echo 1 > /proc/sys/kernel/numa_balancing $ time numactl -m0 -- mpv --no-config --demuxer-seekable-cache=yes --demuxer-max-bytes=2000000000 --no-video --ao null -end 0:60 https://www.twitch.tv/videos/289176308 ... 8.06s user 1.82s system 15% cpu 1:03.95 total As you can see, running with NUMA balancing enabled causes it to take ~70 seconds extra of system time, all of which happens during late shutdown. (After mpv has printed "Exiting..." and is freeing memory). The extra time needed varies sporadically, sometimes only taking 20-30 seconds and sometimes taking as high as 2 minutes. I've attached a `perf record` sample of around 10-20 seconds taken in the middle of the slow free() loop. Created attachment 277933 [details]
`perf record` sample taken during mpv free() loop
I updated to 4.18.5 and disabled NUMA balancing while I was at it. Problem seems to be significantly improved, but still pretty bad. It might be the upgrade to 4.18 but I suspect it was disabling NUMA balancing that improved it. Needs more testing to confirm. It might very well be that there's nothing wrong with Linux's NUMA balancing algorithm, just that it interacts poorly with some bug in the threadripper architecture to produce these massive lag spikes. On "normal" loads (e.g. heavy compilation) I only get 2-3ms latency spikes after upgrade to the NUMA-balancing-disabled 4.18 kernel. Still way too high for a PREEMPT system, but not as high as it used to be. `stressapptest` still kills the system, causing massive slowdown and lag spikes as high as ~50ms after a few seconds of activation. Which, granted, is better by a few orders of magnitude - but the problem isn't gone. The combination of AGESA 1.1.0.1a and Linux 4.18.16 seems to have fixed this issue, both for me and for others experiencing the same. As such, this annoying and awful bug can finally be closed. Hooray! \o/ |