Bug 219009
Summary: | Random host reboots on Ryzen 7000/8000 using nested VMs (vls suspected) | ||
---|---|---|---|
Product: | Virtualization | Reporter: | Žilvinas Žaltiena (zaltys) |
Component: | kvm | Assignee: | virtualization_kvm |
Status: | RESOLVED CODE_FIX | ||
Severity: | high | CC: | animusnull, blake, kernel, mario.limonciello, michal.litwinczuk, mlevitsk, ozonehelix, sagnik, simon |
Priority: | P3 | ||
Hardware: | AMD | ||
OS: | Linux | ||
Kernel Version: | Subsystem: | ||
Regression: | No | Bisected commit-id: | |
Attachments: |
attachment-12508-0.html
attachment-622-0.html |
Description
Žilvinas Žaltiena
2024-07-06 11:20:39 UTC
To all Zen4 users!!! I'm now experimenting with different settings and found way to enable nested virtualization with host cpu type and no performance penalty. Disabling aspm in bios does fix it for win11 guest. (proxmox, hyperv enabled) Further testing is required, so anyone having this issue can contribute by testing other guest operating systems. For now it's best solution until fix from AMD arrives. Unless its irreparable with just microcode update... I can confirm on my Ryzen 9 7900X30 system disabling aspm helps. kind of relieved this is a bug. I was having this issue on my Ryzen 7 5700G system but to a lessor extent and it got worse when I upgraded to Ryzen 7000 series (In reply to Ben Hirlston from comment #2) > I can confirm on my Ryzen 9 7900X30 system disabling aspm helps. kind of > relieved this is a bug. I was having this issue on my Ryzen 7 5700G system > but to a lessor extent and it got worse when I upgraded to Ryzen 7000 series there is a typo I meant 7900X3D sorry about that (In reply to Ben Hirlston from comment #3) > (In reply to Ben Hirlston from comment #2) > > I can confirm on my Ryzen 9 7900X30 system disabling aspm helps. kind of > > relieved this is a bug. I was having this issue on my Ryzen 7 5700G system > > but to a lessor extent and it got worse when I upgraded to Ryzen 7000 > series > > there is a typo I meant 7900X3D sorry about that no I had to do kvm_amd.vls=0 the hit to performance wasn't as bad as I thought but disabling aspm didn't help I was wrong I can confirm the same problem occurring as well on 7950X and 7950X3D CPUs. do we know if Ryzen 9000 has this issue? I know I had this issue on Ryzen 5000 but to a lessor extent Update - aspm does not work, thou it decreased chance of random reboot. Issue might be related to memory addressing method used in zen4. (In reply to Ben Hirlston from comment #6) > do we know if Ryzen 9000 has this issue? I know I had this issue on Ryzen > 5000 but to a lessor extent Could you elaborate on what was happening with 5000? (reboots, mce, something other) (In reply to h4ck3r from comment #8) > (In reply to Ben Hirlston from comment #6) > > do we know if Ryzen 9000 has this issue? I know I had this issue on Ryzen > > 5000 but to a lessor extent > > Could you elaborate on what was happening with 5000? > (reboots, mce, something other) I would be using my Virtual Machine to Windows Windows 11 and would be doing something intensive that was using vls and the machine would reset just like 7000 but it happened way less often (In reply to Ben Hirlston from comment #9) > (In reply to h4ck3r from comment #8) > > (In reply to Ben Hirlston from comment #6) > > > do we know if Ryzen 9000 has this issue? I know I had this issue on Ryzen > > > 5000 but to a lessor extent > > > > Could you elaborate on what was happening with 5000? > > (reboots, mce, something other) > > I would be using my Virtual Machine to Windows Windows 11 and would be doing > something intensive that was using vls and the machine would reset just like > 7000 but it happened way less often I no longer have that machine anymore so I can't test it anymore but I have memory of it happening on a bi weekly to monthly occurrence Im afraid it might be overlooked issue that propagated to zen5. If there is someone with zen5 let know if nestes virt also breaks on it. Kernel and qemu devs perspective is also appriciated on that topic. I recently experienced this. I built a proxmox cluster with 7950x. Every node that I tested on would hard reset with no logs when a VM was doing nested virtualization. Our CI testing uses VMs, and putting the CI in a VM itself makes it pretty easy to reproduce, just takes some time. Setting kvm_amd.vls=0 seems to have resolved the issue, we had zero node resets today and I was trying to force them. Kernel is Proxmox's 6.8.12-1-pve. Thanks, Blake (In reply to blake from comment #13) > I recently experienced this. I built a proxmox cluster with 7950x. Every > node that I tested on would hard reset with no logs when a VM was doing > nested virtualization. > > Our CI testing uses VMs, and putting the CI in a VM itself makes it pretty > easy to reproduce, just takes some time. > > Setting kvm_amd.vls=0 seems to have resolved the issue, we had zero node > resets today and I was trying to force them. > > Kernel is Proxmox's 6.8.12-1-pve. > > Thanks, > Blake Disabling vls does help with crashes, but has too much performance penalty. In my case it would lock gpu utilization at 40% max. (proxmox win10/11 hyperv enabled) the kernel I am running for reference is 6.10.6-zen1-1-zen and I have a Ryzen 9 7900X3D Created attachment 306799 [details] attachment-12508-0.html At least for me, I haven't noticed any performance hit. These systems are all headless though. On Sat, Aug 31, 2024, at 6:51 AM, bugzilla-daemon@kernel.org wrote: > https://bugzilla.kernel.org/show_bug.cgi?id=219009 > > --- Comment #14 from h4ck3r (michal.litwinczuk@op.pl) --- > (In reply to blake from comment #13) > > I recently experienced this. I built a proxmox cluster with 7950x. Every > > node that I tested on would hard reset with no logs when a VM was doing > > nested virtualization. > > > > Our CI testing uses VMs, and putting the CI in a VM itself makes it pretty > > easy to reproduce, just takes some time. > > > > Setting kvm_amd.vls=0 seems to have resolved the issue, we had zero node > > resets today and I was trying to force them. > > > > Kernel is Proxmox's 6.8.12-1-pve. > > > > Thanks, > > Blake > > Disabling vls does help with crashes, but has too much performance penalty. > In my case it would lock gpu utilization at 40% max. (proxmox win10/11 hyperv > enabled) > > -- > You may reply to this email to add a comment. > > You are receiving this mail because: > You are on the CC list for the bug. (In reply to blake from comment #16) > Created attachment 306799 [details] > attachment-12508-0.html > > At least for me, I haven't noticed any performance hit. These systems are > all headless though. > > On Sat, Aug 31, 2024, at 6:51 AM, bugzilla-daemon@kernel.org wrote: > > https://bugzilla.kernel.org/show_bug.cgi?id=219009 > > > > --- Comment #14 from h4ck3r (michal.litwinczuk@op.pl) --- > > (In reply to blake from comment #13) > > > I recently experienced this. I built a proxmox cluster with 7950x. Every > > > node that I tested on would hard reset with no logs when a VM was doing > > > nested virtualization. > > > > > > Our CI testing uses VMs, and putting the CI in a VM itself makes it > pretty > > > easy to reproduce, just takes some time. > > > > > > Setting kvm_amd.vls=0 seems to have resolved the issue, we had zero node > > > resets today and I was trying to force them. > > > > > > Kernel is Proxmox's 6.8.12-1-pve. > > > > > > Thanks, > > > Blake > > > > Disabling vls does help with crashes, but has too much performance penalty. > > In my case it would lock gpu utilization at 40% max. (proxmox win10/11 > hyperv > > enabled) > > > > -- > > You may reply to this email to add a comment. > > > > You are receiving this mail because: > > You are on the CC list for the bug. the performance hit for me hasn't been that bad but disabling vls is the same as disabling svm for the vm its more like telling the VM the CPU is incapable of nested virtualization that is what I observed I was wondering if this issue might be related to this thread on gentoo's bug tracker? https://bugs.gentoo.org/808990 my curiousity relates to this new bug being caused by fixing the CVE's CVE-2021-3653, CVE-2021-3656 I am wondering if the fixes for those CVE's are related to this bug Nope, these are not related, but do check if disabling AVIC helps (set enable_avic parameter of kvm_amd to 0) I mean avic=0 On Sat, 2024-07-06 at 11:20 +0000, bugzilla-daemon@kernel.org wrote: > https://bugzilla.kernel.org/show_bug.cgi?id=219009 > > Bug ID: 219009 > Summary: Random host reboots on Ryzen 7000/8000 using nested > VMs (vls suspected) > Product: Virtualization > Version: unspecified > Hardware: AMD > OS: Linux > Status: NEW > Severity: high > Priority: P3 > Component: kvm > Assignee: virtualization_kvm@kernel-bugs.osdl.org > Reporter: zaltys@natrix.lt > Regression: No > > Running nested VMs on AMD Ryzen 7000/8000 (ZEN4) CPUs results in random > host's > reboots. > > There is no kernel panic, no log entries, no relevant output to serial > console. > It is as if platform is simply hard reset. It seems time to reproduce it > varies > from system to system and can be dependent on workload and even specific CPU > model. > > I can reproduce it with kernel 6.9.7 and qemu 9.0 on Ryzen 7950X3D under one > hour by using KVM -> Windows 10/11 with Hyper-V services on or KVM -> Windows > 10/11 with 3 VBox VMs (also Win11) running. Others people had it repeatedly > reproduced on Ryzen 7700,7600 and 8700GE, including KVM -> KVM -> Linux.[1] I > also have seen Hetzner (company offering Ryzen based dedicated servers) > customers complaining about similiar random reboots. > > I tried looking up errata for Ryzen 7000/8000, but could not find one > published, so I decided to check errata for EPYC 9004 [2], which is also Zen4 > arch as Ryzen 7000/8000. It has nesting related bug #1495 (on page 49), which > mentions using Virtualized VMLOAD/VMSAVE can result in MCE and/or system > reset. > > Based on that errata mentioned above, I reconfigured my system with > kvm_amd.vls=0 and for me random reboots with nested virtualization stopped. > Same was reported by several people from [1]. > > Somebody from AMD must be asked to confirm if it is really Ryzen 7000/8000 > hardware bug, and if there is a better fix than disabling VLS as it has > performance hit. If disabling it is the only fix, then kvm_amd.vls=0 must be > default for Ryzen 7000/8000. > > [1] > > https://www.reddit.com/r/Proxmox/comments/1cym3pl/nested_virtualization_crashing_ryzen_7000_series/ > [2] > > https://www.amd.com/content/dam/amd/en/documents/processor-tech-docs/revision-guides/57095-PUB_1_01.pdf > Hi! Can someone from AMD take a look at this bug: From the bug report it appears that recent Zen4 CPUs have errata in their virtual VMLOAD/VMSAVE implemenatation, which causes random host reboots (#MC?) when nesting is used, which is IMHO a quite serious issue. Thanks, Best regards, Maxim Levitsky Disabling avic does not help. I and some other people tried that a few months ago. I've recently talked to person which insisted they never had issues on their guests running on host cpu type without disabling vls. From what i asked it seems that all guests were linix based and lacked pci passthrought (proxmox newest kernel as of time of this post) Further testing is required. I'm gonna spin up linux guest with approx half of host memory (no balooning) without any external device attached to see if stability can be archived that way. But the question is - did they use nested virtualization on Linux actively and with vls enabled? The use case which causes the reboots as I understand is Hyperv enabled Windows, in which case pretty much the whole Windows is running as a nested VM, nested to the Hyperv hypervisor. Once I get my hands on a client Zen4 machine (I only have Zen2 at home), I will also try to reproduce this but not promises when this will happen. Meanwhile I really hope that someone from AMD can take a look a this, and either confirm that this is or will be fixed with a microcode patch or confirm that we have to disable vls on the affected CPUs. Best regards, Maxim Levitsky (In reply to blake from comment #13) > I recently experienced this. I built a proxmox cluster with 7950x. Every > node that I tested on would hard reset with no logs when a VM was doing > nested virtualization. > > Our CI testing uses VMs, and putting the CI in a VM itself makes it pretty > easy to reproduce, just takes some time. Blake, what OS is used in your VMs ? Created attachment 306983 [details] attachment-622-0.html All of ours are running Debian Bookworm, stock kernel linux-image-6.1.0-26-amd64. And then hosts are 6.8.12-2-pve On Tue, Oct 8, 2024, at 12:53 PM, bugzilla-daemon@kernel.org wrote: > https://bugzilla.kernel.org/show_bug.cgi?id=219009 > > --- Comment #27 from Žilvinas Žaltiena (zaltys@natrix.lt) --- > (In reply to blake from comment #13) > > I recently experienced this. I built a proxmox cluster with 7950x. Every > > node that I tested on would hard reset with no logs when a VM was doing > > nested virtualization. > > > > Our CI testing uses VMs, and putting the CI in a VM itself makes it pretty > > easy to reproduce, just takes some time. > > Blake, what OS is used in your VMs ? > > -- > You may reply to this email to add a comment. > > You are receiving this mail because: > You are on the CC list for the bug. (In reply to mlevitsk from comment #26) > But the question is - did they use nested virtualization on Linux actively > and with vls enabled? > > > The use case which causes the reboots as I understand is Hyperv enabled > Windows, in which case pretty much the whole Windows is running as a nested > VM, nested to the Hyperv hypervisor. > > Once I get my hands on a client Zen4 machine (I only have Zen2 at home), I > will also try to reproduce this but not promises when this will happen. > > Meanwhile I really hope that someone from AMD can take a look a this, and > either confirm that this is or will be fixed with a microcode patch or > confirm that we have to disable vls on the affected CPUs. > > Best regards, > Maxim Levitsky Not really - most of them are microservice type ones. That would also mean there is less chance of corruption since they occupied less host memory. And windows uses nested virt even if hyperv is not installed somehow. (installation does not, but freshly booted guest crashed my node) Im afraid it might be unresolvable issue, even with microcode. At least most things point to similar issue as memory leaks on their igpus. I've also saw some memory related issues after overloading guest ram. For example after having to write to swap on windows guest i've seen pcie stutters which persisted even after freeing memory (v4 cpu type) Same thing happened when guest run for too long. Thou that might be more of an vfio issue than anything else. in my case I am on Arch Linux and I am running 6.11.2-zen1-1-zen the Zen Kernel. I need to disable vls for my VM to remain stable if I don't my system will randomly reboot after a random amount of time. disabling vls helps stop that but it makes nested virtualization stop working so I can't run WSL 2 in my Windows 11 guest. or Hyper V correctly This is very intersting - disabling VLS should not have any effect other than performance loss - KVM emulates VMLOAD/VMSAVE in this case, instead of hardware doing so. @Ben Hirlston can you double check that with VLS WSL2 works, and witouth VLS it doesn't? I wanna say it doesn't if I ever got it to launch it was probably defaulting to WSL1 as a fallback (In reply to h4ck3r from comment #29) > (In reply to mlevitsk from comment #26) > > But the question is - did they use nested virtualization on Linux actively > > and with vls enabled? > > > > > > The use case which causes the reboots as I understand is Hyperv enabled > > Windows, in which case pretty much the whole Windows is running as a nested > > VM, nested to the Hyperv hypervisor. > > > > Once I get my hands on a client Zen4 machine (I only have Zen2 at home), I > > will also try to reproduce this but not promises when this will happen. > > > > Meanwhile I really hope that someone from AMD can take a look a this, and > > either confirm that this is or will be fixed with a microcode patch or > > confirm that we have to disable vls on the affected CPUs. > > > > Best regards, > > Maxim Levitsky > > Not really - most of them are microservice type ones. > That would also mean there is less chance of corruption since they occupied > less host memory. > And windows uses nested virt even if hyperv is not installed somehow. > (installation does not, but freshly booted guest crashed my node) > > Im afraid it might be unresolvable issue, even with microcode. > At least most things point to similar issue as memory leaks on their igpus. I have a 7900x on which I use igpu and RX7800XT as dgpu (use for AI work) and motherboard ASUS ProArt X670E-CREATOR. That being said, I actively use virtualization. When disabling VLS and even disabling nested virtualization completely, my host kept rebooting unpredictably. That is, no advice from this thread helped me. As a result, I disabled igpu in BIOS and enabled nested virtualization and VLS. With this configuration the reboots continued. But after disabling VLS the reboots disappeared and now the PC works without fail. I can assume that the problem with VLS and the use of igpu may be somehow related. It would be very interesting to know if this problem is present on Ryzen 9000. yeah I'm not using my iGPU in my 7900X3D at all I have a 7800 XT and a RTX 3060 that is setup to run with the vfio-pci driver on my MSI MPG X670E CARBON WIFI motherboard. and disabling VLS made my random reboots stop. so to hear that having an igpu enabled caused it to continue is interesting WSL and hyperv should still work with vls disabled on host cpu type. It will run much slower thou. Since there is memory leak on am5 igpus they might contribute to reboots. That said i have igpu enabled in bios (not connected nor used) and never saw reboot with vls disabled. Wow, this bug sent me to a weird path :P I've been using a Windows VM for the past 6 months, daily, without problem. Then suddently 2 days ago (didn't update the kernel or anything else), my PC started randomly rebooting when under load. Reading on problems with AMD cpus (and specifically 7950x), I bought a PSU and swapped it. Same problem. Then the cpu (for a 7600), then the motherboard, then the memory... same. I could 100% reproduce the problem by putting the VM under load within 3 minutes. AFAIK, there was no MCE logged. I finally stumbled on this thread, set kvm_amd.vls=0, and that indeed fully fixed the problem. I don't know how or why hyper-v was suddently enabled in my Windows VM, but that's obviously what happened. When it did, it triggered this bug quite violently. Thanks everyone for your feedback and testing. The following change will go into 6.12 and back to the stable kernels to fix this issue. It is essentially doing the same effect that kvm_amd.vls=0 did. https://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git/commit/?h=x86/urgent&id=a5ca1dc46a6b610dd4627d8b633d6c84f9724ef0 I've been encountering random reboots with a 7950X as well, oddly no nested virtualization. I just came across this via a phoronix article, and tried the kvm_amd.vls=0 argument I've hit my third reboot today after trying to amend the flag. I'm going to try bringing in the patch today. I have two virtual machines I use. Neither is using nested virtualization - Windows 11 with VFIO and pcie passthrough - Linux guest with spice and opengl via virgl Even with just the Linux guest it crashes My main distribution is NixOs with 6.6. But I've tried Fedora, Arch and Ubuntu with 6.6,6.8.6.10 and 6.11. I've disable power saving, and a number of different tweaks. I've also swapped memory, motherboard, disks and power supply. Updated bios, etc. The system started showing instability after I bumped to 6.X, and updating to the bios post voltage issues with the zen 4 cpus. I was on 5.1X for a while due to bugs on gpu initialization while using vfio. There is a note on iGPU which I'm using, and experiencing a number of issues with. I won't go into that, and expect it's due to 6.6, which I'm locked to right now due to zfs. It sounds like you have a separate stability issue, it should be brought into it's own bug. I am testing Linux Zen 6.12 with the vls flag disabled and so far I haven't encountered a crash with WSL2 and Windows 7 in virtualbox in my Windows VM KVM/QEMU VM so I guess we will see if that changes with time to test this I fired up my KVM/QEMU Windows 11 VM and installed virtualbox in it and installed wsl2 to see if running virtual machines in KVM/QEMU would trigger the crash I've been testing for 20 minutes and have yet to see a crash the way I obtained Linux Zen 6.12 was by going to the extra-testing repo on archlinux.org and running sudo pacman -U linux-zen-version linux-zen-headers-version to upgrade to those packages |