Bug 197813 - KVM: entry failed, hardware error 0x5
Summary: KVM: entry failed, hardware error 0x5
Status: NEW
Alias: None
Product: Virtualization
Classification: Unclassified
Component: kvm (show other bugs)
Hardware: Intel Linux
: P1 blocking
Assignee: virtualization_kvm
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2017-11-08 09:28 UTC by biaoxiangye
Modified: 2018-03-20 17:51 UTC (History)
1 user (show)

See Also:
Kernel Version: 3.13.6
Subsystem:
Regression: No
Bisected commit-id:


Attachments
qemu log info (3.25 KB, application/gzip)
2017-11-08 09:28 UTC, biaoxiangye
Details

Description biaoxiangye 2017-11-08 09:28:32 UTC
Created attachment 260561 [details]
qemu log info

[host CPU]:
Intel(R) Xeon(R) CPU E5-2630 v4 @ 2.20GHz
[qemu cmdline]:
/usr/bin/qemu-system-x86_64 -name CZJ-TEST_clone2 -S -machine pc-i440fx-2.1,accel=kvm,usb=off,system=linux -m 4096,slots=10,maxmem=33554432M  -smp 16,maxcpus=160,sockets=10,cores=16,threads=1 -numa node,nodeid=0,cpus=0-15,mem=4096 -uuid 8ef69d74-8e6a-4c5a-9570-a2d7f6f8bf66 -no-user-config -nodefaults -chardev socket,id=charmonitor,path=/run/lib/libvirt/qemu/CZJ-TEST_clone2.monitor,server,nowait -mon chardev=charmonitor,id=monitor,mode=control -rtc base=utc -no-shutdown -global PIIX4_PM.disable_s3=1 -global PIIX4_PM.disable_s4=1 -device piix3-usb-uhci,id=usb,bus=pci.0,addr=0x1.0x2 -device usb-ehci,id=ehci,bus=pci.0,addr=0x4 -device nec-usb-xhci,id=xhci,bus=pci.0,addr=0x5 -device virtio-scsi-pci,id=scsi1,bus=pci.0,addr=0x6 -device virtio-serial-pci,id=virtio-serial0,bus=pci.0,addr=0x7 -drive file=/vms/images/CZJ-TEST_clone2,if=none,id=drive-virtio-disk0,format=qcow2,cache=directsync,aio=native -device virtio-blk-pci,scsi=off,bus=pci.0,addr=0x8,drive=drive-virtio-disk0,id=virtio-disk0,bootindex=1 -drive file=/vms/isos/iso9600-a11e6280-96ca-4725-ac07-23597e2083e8.iso,if=none,id=drive-ide0-0-0,readonly=on,format=raw,cache=directsync,aio=native -device ide-cd,bus=ide.0,unit=0,drive=drive-ide0-0-0,id=ide0-0-0,bootindex=2 -netdev tap,fd=49,id=hostnet0,vhost=on,vhostfd=50 -device virtio-net-pci,netdev=hostnet0,id=net0,mac=0c:da:41:1d:b9:10,bus=pci.0,addr=0x3 -chardev socket,id=charchannel0,path=/var/lib/libvirt/qemu/CZJ-TEST_clone2.agent,server,nowait -device virtserialport,bus=virtio-serial0.0,nr=1,chardev=charchannel0,id=channel0,name=org.qemu.guest_agent.0 -chardev pty,id=charserial0 -device isa-serial,chardev=charserial0,id=serial0 -device usb-tablet,id=input0,bus=usb.0 -vnc 0.0.0.0:4 -device cirrus-vga,id=video0,bus=pci.0,addr=0x2 -device virtio-balloon-pci,id=balloon0,bus=pci.0,addr=0x9
[qemu version]
qemu 2.1.2
[qemu error log]
KVM: entry failed, hardware error 0x5
EAX=ffffffed EBX=37fe0000 ECX=00000000 EDX=00000000
ESI=00000000 EDI=00000046 EBP=37fe3e98 ESP=00000080
EIP=81058e96 EFL=001db68a [D-S----] CPL=0 II=0 A20=1 SMM=0 HLT=0
ES =0080 003eb338 00000000 00008000 DPL=0 <hiword>
CS =44c1 001de741 00000000 00009e00 DPL=0 CS16 [CR-]
SS =097a 00400000 fffff680 00008000 DPL=0 <hiword>
DS =0080 0020bf62 00000000 00008000 DPL=0 <hiword>
FS =0080 001dc5a4 00000000 00008000 DPL=0 <hiword>
GS =cdc3 001dca83 00000000 00009d00 DPL=0 CS16 [C-A]
LDT=de42 001cedc2 00000000 00009c00 DPL=0 CS16 [C--]
TR =0f42 00400000 fffff680 00008000 DPL=0 <hiword>
GDT=     0000000000400000 0000f680
IDT=     00000000001df68c 00000000
CR0=8005003b CR2=00007f43f702b2c8 CR3=00000000001dd2cb CR4=0000046a
DR0=0000000000000000 DR1=0000000000000000 DR2=0000000000000000 DR3=0000000000000000 
DR6=00000000fffe0ff0 DR7=0000000000000400
EFER=0000000000000d01
Code=?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? <??> ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ??

It occurs several times and don't know how to reporduced. 
I have searched the error number 5 in talbe of VM-Instruction Error Numbers, it means VMRESUME with non-launched VMCS.
VMRESUME  should be used for any subsequent VM entry using a VMCS(until the next execution of VMCLEAR for the VMCS)
Dos it's a kvm bug or cpu issues?
Comment 1 Carlos Lopez 2018-02-13 17:26:29 UTC
We are experiencing the same problem with an equivalent proccessor of the same generation.

[host CPU]:
Intel(R) Xeon(R) CPU E5-2650 v4 @ 2.20GHz

[kvmqemulog]

/usr/bin/kvm -id 139 -chardev socket,id=qmp,path=/var/run/qemu-server/139.qmp,server,nowait -mon chardev=qmp,mode=control -pidfile /var/run/qemu-server/139.pid -smbios type=1,uuid=04fc63bd-caef-424d-a047-c99d0ffd5be0 -name DEVsai-store01.es.amnesty.org -smp 10,sockets=1,cores=10,maxcpus=10 -nodefaults -boot menu=on,strict=on,reboot-timeout=1000,splash=/usr/share/qemu-server/bootsplash.jpg -vga std -vnc unix:/var/run/qemu-server/139.vnc,x509,password -cpu Broadwell-noTSX,+kvm_pv_unhalt,+kvm_pv_eoi,enforce,vendor=GenuineIntel -m 12288 -k es -device pci-bridge,id=pci.1,chassis_nr=1,bus=pci.0,addr=0x1e -device pci-bridge,id=pci.2,chassis_nr=2,bus=pci.0,addr=0x1f -device piix3-usb-uhci,id=uhci,bus=pci.0,addr=0x1.0x2 -device usb-tablet,id=tablet,bus=uhci.0,port=1 -chardev socket,id=serial0,path=/var/run/qemu-server/139.serial0,server,nowait -device isa-serial,chardev=serial0 -iscsi initiator-name=iqn.1993-08.org.debian:01:481b3cd8ee13 -drive if=none,id=drive-ide2,media=cdrom,aio=threads -device ide-cd,bus=ide.1,unit=0,drive=drive-ide2,id=ide2,bootindex=200 -device virtio-scsi-pci,id=scsihw0,bus=pci.0,addr=0x5 -drive file=/dev/zvol/zdata1/vm-139-disk-4,if=none,id=drive-scsi0,format=raw,cache=none,aio=native,detect-zeroes=on -device scsi-hd,bus=scsihw0.0,channel=0,scsi-id=0,lun=0,drive=drive-scsi0,id=scsi0 -drive file=/dev/zvol/zdata1/vm-139-disk-5,if=none,id=drive-scsi1,format=raw,cache=none,aio=native,detect-zeroes=on -device scsi-hd,bus=scsihw0.0,channel=0,scsi-id=0,lun=1,drive=drive-scsi1,id=scsi1 -drive file=/dev/zvol/zdata1/vm-139-disk-6,if=none,id=drive-scsi3,format=raw,cache=none,aio=native,detect-zeroes=on -device scsi-hd,bus=scsihw0.0,channel=0,scsi-id=0,lun=3,drive=drive-scsi3,id=scsi3 -drive file=/dev/zvol/zdata1/vm-139-disk-7,if=none,id=drive-scsi4,format=raw,cache=none,aio=native,detect-zeroes=on -device scsi-hd,bus=scsihw0.0,channel=0,scsi-id=0,lun=4,drive=drive-scsi4,id=scsi4 -device ahci,id=ahci0,multifunction=on,bus=pci.0,addr=0x7 -drive file=/dev/zvol/zdata1/vm-139-disk-1,if=none,id=drive-sata0,format=raw,cache=none,aio=native,detect-zeroes=on -device ide-drive,bus=ahci0.0,drive=drive-sata0,id=sata0,bootindex=104 -drive file=/dev/zvol/zdata1/vm-139-disk-2,if=none,id=drive-sata3,format=raw,cache=none,aio=native,detect-zeroes=on -device ide-drive,bus=ahci0.3,drive=drive-sata3,id=sata3 -drive file=/dev/zvol/zdata1/vm-139-disk-3,if=none,id=drive-sata5,format=raw,cache=none,aio=native,detect-zeroes=on -device ide-drive,bus=ahci0.5,drive=drive-sata5,id=sata5 -netdev type=tap,id=net0,ifname=tap139i0,script=/var/lib/qemu-server/pve-bridge,downscript=/var/lib/qemu-server/pve-bridgedown -device e1000,mac=62:20:40:70:2E:55,netdev=net0,bus=pci.0,addr=0x12,id=net0,bootindex=300

KVM: entry failed, hardware error 0x5
RAX=0000000000000000 RBX=ffffffff81f3c9c0 RCX=0000000000000000 RDX=0000000000000000
RSI=0000000000000000 RDI=0000000000000000 RBP=ffff8803313dbe90 RSP=ffff8803313dbe90
R8 =ffff880333490580 R9 =0000000000000000 R10=00000000ffff5b28 R11=000000000001e400
R12=0000000000000009 R13=0000000000000000 R14=0000000000000000 R15=ffff8803313d8000
RIP=ffffffff81064646 RFL=00000246 [---Z-P-] CPL=0 II=0 A20=1 SMM=0 HLT=0
ES =0000 0000000000000000 ffffffff 00000000
CS =0010 0000000000000000 ffffffff 00a09b00 DPL=0 CS64 [-RA]
SS =0018 0000000000000000 ffffffff 00c09300 DPL=0 DS   [-WA]
DS =0000 0000000000000000 ffffffff 00000000
FS =0000 0000000000000000 ffffffff 00000000
GS =0000 ffff880333480000 ffffffff 00000000
LDT=0000 0000000000000000 ffffffff 00000000
TR =0040 ffff8803334848c0 00002087 00008b00 DPL=0 TSS64-busy
GDT=     ffff88033348c000 0000007f
IDT=     ffffffffff574000 00000fff
CR0=80050033 CR2=00007f567ecf4700 CR3=000000032ef4e000 CR4=00360670
DR0=0000000000000000 DR1=0000000000000000 DR2=0000000000000000 DR3=0000000000000000
DR6=00000000fffe0ff0 DR7=0000000000000400
EFER=0000000000000d01
Code=89 e5 fb 5d c3 66 0f 1f 84 00 00 00 00 00 55 48 89 e5 fb f4 <5d> c3 0f 1f 84 00 00 00 00 00 55 48 89 e5 f4 5d c3 66 0f 1f 84 00 00 00 00 00 55 49 89 c9 

Since the error and the plaform is almost identical could it be Broadwell/Xeon related?
Comment 2 Carlos Lopez 2018-02-13 17:32:51 UTC
Kernel info in our case

4.13.13-5-pve #1 SMP PVE 4.13.13-38 x86_64
Comment 3 Bandan Das 2018-02-13 18:13:18 UTC
Did you happen to try the latest upstream kernel as well ?
Comment 4 Carlos Lopez 2018-02-14 15:15:44 UTC
At the moment since we are using PRoxmox and they build the distro around the ubuntu kernel we could wait until the release of ubuntu 18.04 when they build the next versión to try a new kernel. 
but
Considering we haven't the skills to debug more thoroughly  the panic, atm, we are thinking in Downgrade the platform to an older one and see if the problem persist. 
We have been experiencing worse and worse behaviour at every update of the kernel/distro

Since we have another machine with an older processor from the same family and we are not experiencing any issues i attach cpuinfo of the problematic one and the goodold one in case is relevant.

[KVM WITh panics/frozen vm]
Kernel 

4.13.13-5-pve #1 SMP PVE 4.13.13-38 x86_64

processor	: 40
vendor_id	: GenuineIntel
cpu family	: 6
model		: 79
model name	: Intel(R) Xeon(R) CPU E5-2650 v4 @ 2.20GHz
stepping	: 1
microcode	: 0xb000021
cpu MHz		: 2199.991
cache size	: 30720 KB
physical id	: 0
siblings	: 24
core id		: 10
cpu cores	: 12
apicid		: 21
initial apicid	: 21
fpu		: yes
fpu_exception	: yes
cpuid level	: 20
wp		: yes
flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch cpuid_fault epb cat_l3 cdp_l3 invpcid_single intel_pt tpr_shadow vnmi flexpriority ept vpid fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm cqm rdt_a rdseed adx smap xsaveopt cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local dtherm ida arat pln pts
bugs		: cpu_meltdown spectre_v1 spectre_v2
bogomips	: 4399.98
clflush size	: 64
cache_alignment	: 64
address sizes	: 46 bits physical, 48 bits virtual
power management:


[KVM WITHOUT PROBLEMS] 
Kernel 4.4.95-1-pve #1 SMP PVE 4.4.95-99

processor	: 39
vendor_id	: GenuineIntel
cpu family	: 6
model		: 63
model name	: Intel(R) Xeon(R) CPU E5-2660 v3 @ 2.60GHz
stepping	: 2
microcode	: 0x31
cpu MHz		: 2743.203
cache size	: 25600 KB
physical id	: 1
siblings	: 20
core id		: 12
cpu cores	: 10
apicid		: 57
initial apicid	: 57
fpu		: yes
fpu_exception	: yes
cpuid level	: 15
wp		: yes
flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc aperfmperf eagerfpu pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm epb tpr_shadow vnmi flexpriority ept vpid fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid cqm xsaveopt cqm_llc cqm_occup_llc dtherm ida arat pln pts
bugs		:
bogomips	: 5201.14
clflush size	: 64
cache_alignment	: 64
address sizes	: 46 bits physical, 48 bits virtual
power management:
Comment 5 Bandan Das 2018-02-15 18:12:08 UTC
bugzilla-daemon@bugzilla.kernel.org writes:

> https://bugzilla.kernel.org/show_bug.cgi?id=197813
>
> --- Comment #4 from Carlos Lopez (clopezbelenguer@es.amnesty.org) ---
> At the moment since we are using PRoxmox and they build the distro around the
> ubuntu kernel we could wait until the release of ubuntu 18.04 when they build
> the next versión to try a new kernel. 
> but
> Considering we haven't the skills to debug more thoroughly  the panic, atm,
> we
> are thinking in Downgrade the platform to an older one and see if the problem
> persist. 
> We have been experiencing worse and worse behaviour at every update of the
> kernel/distro

This does sound good. If you can run an older version of the kernel and confirm
if the issue goes away, that would be a good place to start. Please update the bug
with the older kernel version when you get a chance to run this test.

> Since we have another machine with an older processor from the same family
> and
> we are not experiencing any issues i attach cpuinfo of the problematic one
> and
> the goodold one in case is relevant.

What's the qemu cmdline for the Haswell host ? Can you try -cpu Haswell on the Broadwell
host as well and see if it makes any difference ? 

> [KVM WITh panics/frozen vm]
> Kernel 
>
> 4.13.13-5-pve #1 SMP PVE 4.13.13-38 x86_64
>
> processor       : 40
> vendor_id       : GenuineIntel
> cpu family      : 6
> model           : 79
> model name      : Intel(R) Xeon(R) CPU E5-2650 v4 @ 2.20GHz
> stepping        : 1
> microcode       : 0xb000021
> cpu MHz         : 2199.991
> cache size      : 30720 KB
> physical id     : 0
> siblings        : 24
> core id         : 10
> cpu cores       : 12
> apicid          : 21
> initial apicid  : 21
> fpu             : yes
> fpu_exception   : yes
> cpuid level     : 20
> wp              : yes
> flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca
> cmov
> pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb
> rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology
> nonstop_tsc cpuid aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx smx est
> tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic movbe popcnt
> tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch
> cpuid_fault epb cat_l3 cdp_l3 invpcid_single intel_pt tpr_shadow vnmi
> flexpriority ept vpid fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms
> invpcid
> rtm cqm rdt_a rdseed adx smap xsaveopt cqm_llc cqm_occup_llc cqm_mbm_total
> cqm_mbm_local dtherm ida arat pln pts
> bugs            : cpu_meltdown spectre_v1 spectre_v2
> bogomips        : 4399.98
> clflush size    : 64
> cache_alignment : 64
> address sizes   : 46 bits physical, 48 bits virtual
> power management:
>
>
> [KVM WITHOUT PROBLEMS] 
> Kernel 4.4.95-1-pve #1 SMP PVE 4.4.95-99
>
> processor       : 39
> vendor_id       : GenuineIntel
> cpu family      : 6
> model           : 63
> model name      : Intel(R) Xeon(R) CPU E5-2660 v3 @ 2.60GHz
> stepping        : 2
> microcode       : 0x31
> cpu MHz         : 2743.203
> cache size      : 25600 KB
> physical id     : 1
> siblings        : 20
> core id         : 12
> cpu cores       : 10
> apicid          : 57
> initial apicid  : 57
> fpu             : yes
> fpu_exception   : yes
> cpuid level     : 15
> wp              : yes
> flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca
> cmov
> pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb
> rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology
> nonstop_tsc aperfmperf eagerfpu pni pclmulqdq dtes64 monitor ds_cpl vmx smx
> est
> tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic movbe popcnt
> tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm epb tpr_shadow vnmi
> flexpriority ept vpid fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid
> cqm
> xsaveopt cqm_llc cqm_occup_llc dtherm ida arat pln pts
> bugs            :
> bogomips        : 5201.14
> clflush size    : 64
> cache_alignment : 64
> address sizes   : 46 bits physical, 48 bits virtual
> power management:
Comment 6 Carlos Lopez 2018-02-22 10:52:31 UTC
c9(In reply to Bandan Das from comment #5)
> bugzilla-daemon@bugzilla.kernel.org writes:
> 
> > https://bugzilla.kernel.org/show_bug.cgi?id=197813
> >
> > --- Comment #4 from Carlos Lopez (clopezbelenguer@es.amnesty.org) ---
> > At the moment since we are using PRoxmox and they build the distro around
> the
> > ubuntu kernel we could wait until the release of ubuntu 18.04 when they
> build
> > the next versión to try a new kernel. 
> > but
> > Considering we haven't the skills to debug more thoroughly  the panic, atm,
> > we
> > are thinking in Downgrade the platform to an older one and see if the
> problem
> > persist. 
> > We have been experiencing worse and worse behaviour at every update of the
> > kernel/distro
> 
> This does sound good. If you can run an older version of the kernel and
> confirm
> if the issue goes away, that would be a good place to start. Please update
> the bug
> with the older kernel version when you get a chance to run this test.

we tried the newer kernel 4.15.3-041503-generic with no success

Machine crashed with this output in less than 24h. Any clues about what to do next 

KVM: entry failed, hardware error 0x50ffd5be0 -name DEVsai-store01.es.amnesty.org -smp 10,sockets=1,cores=10,maxcpus=10 -nodefaults -boot menu=on,strict=on,reboot-timeout=1000,splash=/usr/shaRAX=0000000000000000 RBX=ffffffff81f3c9c0 RCX=0000000000000000 RDX=0000000000000000password -cpu Haswell,+kvm_pv_unhalt,+kvm_pv_eoi,enforce,vendor=GenuineIntel -m 12288 -k es -device pci-bridRSI=0000000000000000 RDI=0000000000000000 RBP=ffff8803313cbe90 RSP=ffff8803313cbe90=2,bus=pci.0,addr=0x1f -device piix3-usb-uhci,id=uhci,bus=pci.0,addr=0x1.0x2 -device usb-tablet,id=tablet,buR8 =ffff880333390580 R9 =0000000000000000 R10=00000001005d277a R11=0000000000003c00rver,nowait -device isa-serial,chardev=serial0 -iscsi initiator-name=iqn.1993-08.org.debian:01:481b3cd8ee13 R12=0000000000000007 R13=0000000000000000 R14=0000000000000000 R15=ffff8803313c80000,drive=drive-ide2,id=ide2,bootindex=200 -device virtio-scsi-pci,id=scsihw0,bus=pci.0,addr=0x5 -drive file=/RIP=ffffffff81064646 RFL=00000246 [---Z-P-] CPL=0 II=0 A20=1 SMM=0 HLT=0=native,detect-zeroes=on -device scsi-hd,bus=scsihw0.0,channel=0,scsi-id=0,lun=0,drive=drive-scsi0,id=scsi0 -drive fileES =0000 0000000000000000 ffffffff 00000000scsi1,format=raw,cache=none,aio=native,detect-zeroes=on -device scsi-hd,bus=scsihw0.0,channel=0,scsi-id=0,lun=1,drive=drive-scsi1,id=scsi1 -drive fiCS =0010 0000000000000000 ffffffff 00a09b00 DPL=0 CS64 [-RA]w,cache=none,aio=native,detect-zeroes=on -device scsi-hd,bus=scsihw0.0,channel=0,scsi-id=0,lun=3,drive=drive-scsi3,id=scsi3 -drive SS =0018 0000000000000000 ffffffff 00c09300 DPL=0 DS   [-WA]raw,cache=none,aio=native,detect-zeroes=on -device scsi-hd,bus=scsihw0.0,channel=0,scsi-id=0,lun=4,drive=drive-scsi4,id=scsi4 -deviDS =0000 0000000000000000 ffffffff 00000000,addr=0x7 -drive file=/dev/data2/vm-139-disk-1,if=none,id=drive-sata0,format=raw,cache=none,aio=native,detect-zeroes=on -device ide-drive,bus=ahci0.FS =0000 0000000000000000 ffffffff 00000000-drive file=/dev/data2/vm-139-disk-2,if=none,id=drive-sata3,format=raw,cache=none,aio=native,detect-zeroes=on -device ide-drive,bus=ahci0.3,drive=drGS =0000 ffff880333380000 ffffffff 00000000m-139-disk-3,if=none,id=drive-sata5,format=raw,cache=none,aio=native,detect-zeroes=on -device ide-drive,bus=ahci0.5,drive=drive-sata5,id=sata5 -netdLDT=0000 0000000000000000 ffffffff 00000000/var/lib/qemu-server/pve-bridge,downscript=/var/lib/qemu-server/pve-bridgedown -device e1000,mac=62:20:40:70:2E:55,netdev=net0,bus=pci.0,addr=0x12,iTR =0040 ffff8803333848c0 00002087 00008b00 DPL=0 TSS64-busy
GDT=     ffff88033338c000 0000007f
IDT=     ffffffffff574000 00000fff
CR0=80050033 CR2=00007f4b4f09ec80 CR3=00000000b9cf4000 CR4=00160670
DR0=0000000000000000 DR1=0000000000000000 DR2=0000000000000000 DR3=0000000000000000
DR6=00000000fffe0ff0 DR7=0000000000000400
EFER=0000000000000d01
Code=89 e5 fb 5d c3 66 0f 1f 84 00 00 00 00 00 55 48 89 e5 fb f4 <5d> c3 0f 1f 84 00 00 00 00 00 55 48 89 e5 f4 5d c3 66 0f 1f 84 00 00 00 00 00 55 49 89 
> 
> > Since we have another machine with an older processor from the same family
> > and
> > we are not experiencing any issues i attach cpuinfo of the problematic one
> > and
> > the goodold one in case is relevant.
> 
> What's the qemu cmdline for the Haswell host ? Can you try -cpu Haswell on
> the Broadwell
> host as well and see if it makes any difference ? 

This is the cmd in the haswell host:

/usr/bin/kvm -id 154 -chardev socket,id=qmp,path=/var/run/qemu-server/154.qmp,server,nowait -mon chardev=qmp,mode=control -pidfile /var/run/qemu-server/154.pid -daemonize -smbios type=1,uuid=04fc63bd-caef-424d-a047-c99d0ffd5be0 -name sai-store01.es.amnesty.org -smp 6,sockets=1,cores=6,maxcpus=6 -nodefaults -boot menu=on,strict=on,reboot-timeout=1000,splash=/usr/share/qemu-server/bootsplash.jpg -vga cirrus -vnc unix:/var/run/qemu-server/154.vnc,x509,password -cpu kvm64,+lahf_lm,+sep,+kvm_pv_unhalt,+kvm_pv_eoi,enforce -m 8000 -k es -device pci-bridge,id=pci.1,chassis_nr=1,bus=pci.0,addr=0x1e -device pci-bridge,id=pci.2,chassis_nr=2,bus=pci.0,addr=0x1f -device piix3-usb-uhci,id=uhci,bus=pci.0,addr=0x1.0x2 -device usb-tablet,id=tablet,bus=uhci.0,port=1 -iscsi initiator-name=iqn.1993-08.org.debian:01:9f87b7a31af -drive if=none,id=drive-ide2,media=cdrom,aio=threads -device ide-cd,bus=ide.1,unit=0,drive=drive-ide2,id=ide2,bootindex=200 -device virtio-scsi-pci,id=scsihw0,bus=pci.0,addr=0x5 -drive file=/dev/pve/vm-154-disk-1,if=none,id=drive-scsi0,format=raw,cache=none,aio=native,detect-zeroes=on -device scsi-hd,bus=scsihw0.0,channel=0,scsi-id=0,lun=0,drive=drive-scsi0,id=scsi0 -drive file=/srv/data/images/154/vm-154-disk-7.raw,if=none,id=drive-scsi1,format=raw,cache=none,aio=native,detect-zeroes=on -device scsi-hd,bus=scsihw0.0,channel=0,scsi-id=0,lun=1,drive=drive-scsi1,id=scsi1 -drive file=/srv/data/images/154/vm-154-disk-1.raw,if=none,id=drive-scsi2,cache=writethrough,format=raw,aio=threads,detect-zeroes=on -device scsi-hd,bus=scsihw0.0,channel=0,scsi-id=0,lun=2,drive=drive-scsi2,id=scsi2 -drive file=/srv/data/images/154/vm-154-disk-8.raw,if=none,id=drive-scsi3,format=raw,cache=none,aio=native,detect-zeroes=on -device scsi-hd,bus=scsihw0.0,channel=0,scsi-id=0,lun=3,drive=drive-scsi3,id=scsi3 -drive file=/srv/data/images/154/vm-154-disk-9.raw,if=none,id=drive-scsi4,format=raw,cache=none,aio=native,detect-zeroes=on -device scsi-hd,bus=scsihw0.0,channel=0,scsi-id=0,lun=4,drive=drive-scsi4,id=scsi4 -device ahci,id=ahci0,multifunction=on,bus=pci.0,addr=0x7 -drive file=/dev/pve/vm-154-disk-2,if=none,id=drive-sata0,format=raw,cache=none,aio=native,detect-zeroes=on -device ide-drive,bus=ahci0.0,drive=drive-sata0,id=sata0,bootindex=105 -drive file=/srv/data/images/154/vm-154-disk-4.raw,if=none,id=drive-sata3,format=raw,cache=none,aio=native,detect-zeroes=on -device ide-drive,bus=ahci0.3,drive=drive-sata3,id=sata3 -netdev type=tap,id=net0,ifname=tap154i0,script=/var/lib/qemu-server/pve-bridge,downscript=/var/lib/qemu-server/pve-bridgedown -device e1000,mac=[xxxx],netdev=net0,bus=pci.0,addr=0x12,id=net0,bootindex=300

We tried the haswell cpu in the broadwell without success either. 

One thing left to do is to try an older kernel. 

> 
> > [KVM WITh panics/frozen vm]
> > Kernel 
> >
> > 4.13.13-5-pve #1 SMP PVE 4.13.13-38 x86_64
> >
> > processor       : 40
> > vendor_id       : GenuineIntel
> > cpu family      : 6
> > model           : 79
> > model name      : Intel(R) Xeon(R) CPU E5-2650 v4 @ 2.20GHz
> > stepping        : 1
> > microcode       : 0xb000021
> > cpu MHz         : 2199.991
> > cache size      : 30720 KB
> > physical id     : 0
> > siblings        : 24
> > core id         : 10
> > cpu cores       : 12
> > apicid          : 21
> > initial apicid  : 21
> > fpu             : yes
> > fpu_exception   : yes
> > cpuid level     : 20
> > wp              : yes
> > flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca
> > cmov
> > pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx
> pdpe1gb
> > rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology
> > nonstop_tsc cpuid aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx smx
> est
> > tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic movbe
> popcnt
> > tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch
> > cpuid_fault epb cat_l3 cdp_l3 invpcid_single intel_pt tpr_shadow vnmi
> > flexpriority ept vpid fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms
> > invpcid
> > rtm cqm rdt_a rdseed adx smap xsaveopt cqm_llc cqm_occup_llc cqm_mbm_total
> > cqm_mbm_local dtherm ida arat pln pts
> > bugs            : cpu_meltdown spectre_v1 spectre_v2
> > bogomips        : 4399.98
> > clflush size    : 64
> > cache_alignment : 64
> > address sizes   : 46 bits physical, 48 bits virtual
> > power management:
> >
> >
> > [KVM WITHOUT PROBLEMS] 
> > Kernel 4.4.95-1-pve #1 SMP PVE 4.4.95-99
> >
> > processor       : 39
> > vendor_id       : GenuineIntel
> > cpu family      : 6
> > model           : 63
> > model name      : Intel(R) Xeon(R) CPU E5-2660 v3 @ 2.60GHz
> > stepping        : 2
> > microcode       : 0x31
> > cpu MHz         : 2743.203
> > cache size      : 25600 KB
> > physical id     : 1
> > siblings        : 20
> > core id         : 12
> > cpu cores       : 10
> > apicid          : 57
> > initial apicid  : 57
> > fpu             : yes
> > fpu_exception   : yes
> > cpuid level     : 15
> > wp              : yes
> > flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca
> > cmov
> > pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx
> pdpe1gb
> > rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology
> > nonstop_tsc aperfmperf eagerfpu pni pclmulqdq dtes64 monitor ds_cpl vmx smx
> > est
> > tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic movbe
> popcnt
> > tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm epb tpr_shadow
> vnmi
> > flexpriority ept vpid fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid
> > cqm
> > xsaveopt cqm_llc cqm_occup_llc dtherm ida arat pln pts
> > bugs            :
> > bogomips        : 5201.14
> > clflush size    : 64
> > cache_alignment : 64
> > address sizes   : 46 bits physical, 48 bits virtual
> > power management:
Comment 7 Carlos Lopez 2018-02-22 15:36:34 UTC
Older Kernel TRied with no success, Same error, same frozen state 

CPU(s)48 x Intel(R) Xeon(R) CPU E5-2650 v4 @ 2.20GHz (2 Sockets)
Kernel Version Linux 4.10.17-3-pve 
(this kernel that works in the other server machine without problem)

Any advice about what to try next or how to debug more thoroughly the problem ?



KVM: entry failed, hardware error 0x5
RAX=0000000000000000 RBX=ffffffff81f3c9c0 RCX=0000000000000000 RDX=0000000000000000
RSI=0000000000000000 RDI=0000000000000000 RBP=ffff8803313abe90 RSP=ffff8803313abe90
R8 =ffff880333090580 R9 =0000000000000000 R10=00000001000eff39 R11=0000000000006800
R12=0000000000000001 R13=0000000000000000 R14=0000000000000000 R15=ffff8803313a8000
RIP=ffffffff81064646 RFL=00000246 [---Z-P-] CPL=0 II=0 A20=1 SMM=0 HLT=0
ES =0000 0000000000000000 ffffffff 00000000
CS =0010 0000000000000000 ffffffff 00a09b00 DPL=0 CS64 [-RA]
SS =0018 0000000000000000 ffffffff 00c09300 DPL=0 DS   [-WA]
DS =0000 0000000000000000 ffffffff 00000000
FS =0000 0000000000000000 ffffffff 00000000
GS =0000 ffff880333080000 ffffffff 00000000
LDT=0000 0000000000000000 ffffffff 00000000
TR =0040 ffff8803330848c0 00002087 00008b00 DPL=0 TSS64-busy
GDT=     ffff88033308c000 0000007f
IDT=     ffffffffff574000 00000fff
CR0=80050033 CR2=00007f991acac700 CR3=000000032e452000 CR4=00160670
DR0=0000000000000000 DR1=0000000000000000 DR2=0000000000000000 DR3=0000000000000000
DR6=00000000fffe0ff0 DR7=0000000000000400
EFER=0000000000000d01
Code=89 e5 fb 5d c3 66 0f 1f 84 00 00 00 00 00 55 48 89 e5 fb f4 <5d> c3 0f 1f 84 00 00 00 00 00 55 48 89 e5 f4 5d c3 66 0f 1f 84 00 00 00 00 00 55 49 89 c9
Comment 8 biaoxiangye 2018-03-01 09:48:53 UTC
hi, all
 We have several servers with the same cpu type, but just one of them occur this problem. So we  guess it's a cpu defect, and after replace the cpu , the problem does not occur anymore. 
   It seems like a cpu defect, but not 100% confirmation.
Comment 9 Carlos Lopez 2018-03-12 10:22:49 UTC
(In reply to biaoxiangye from comment #8)
> hi, all
>  We have several servers with the same cpu type, but just one of them occur
> this problem. So we  guess it's a cpu defect, and after replace the cpu ,
> the problem does not occur anymore. 
>    It seems like a cpu defect, but not 100% confirmation.

hi, 

We tried the latest microcode available for our processor and it is stable ATM (since 10 days) 
processor	: 47
vendor_id	: GenuineIntel
cpu family	: 6
model		: 79
model name	: Intel(R) Xeon(R) CPU E5-2650 v4 @ 2.20GHz
stepping	: 1
microcode	: 0xb00002a

Can you confirm the microcode the rogue cpu/machine has ?
Comment 10 Carlos Lopez 2018-03-20 17:51:00 UTC
It happened again after upgrading the kernel, Same error/panic. 
Weird thing is we reverted to last working kernel and it happened again. Linux 4.10.17-3-pve. 
No clue about how to trigger the panic. 
No clue why it was stable for 10 days.

Note You need to log in before you can comment on or make changes to this bug.