In the wake of the Intel TSX bug and the release of the microcode-update one issue cropped up post-resume. The microcode-update is not re-applied to cpu0. On SMP systems this results in cpu0 having TSX instructions enabled and the remaining cpus having them disabled by the microcode-update. # dmesg | grep microcode [ 0.000000] CPU0 microcode updated early to revision 0x1c, date = 2014-07-03 [ 0.093040] CPU1 microcode updated early to revision 0x1c, date = 2014-07-03 [ 0.113650] CPU2 microcode updated early to revision 0x1c, date = 2014-07-03 [ 0.134293] CPU3 microcode updated early to revision 0x1c, date = 2014-07-03 [ 0.345610] microcode: CPU0 sig=0x306c3, pf=0x2, revision=0x1c [ 0.345615] microcode: CPU1 sig=0x306c3, pf=0x2, revision=0x1c [ 0.345620] microcode: CPU2 sig=0x306c3, pf=0x2, revision=0x1c [ 0.345623] microcode: CPU3 sig=0x306c3, pf=0x2, revision=0x1c [ 0.345654] microcode: Microcode Update Driver: v2.00 <tigran@aivazian.fsnet.co.uk>, Peter Oruba <SUSPEND> [ 79.131245] CPU1 microcode updated early to revision 0x1c, date = 2014-07-03 [ 79.145367] CPU2 microcode updated early to revision 0x1c, date = 2014-07-03 [ 79.159411] CPU3 microcode updated early to revision 0x1c, date = 2014-07-03 dmesg confirms that cpu0 doesn't receive the update on resume but the kernel is unaware of this and still reports the same microcode version for all cpus (or cores): # cat /sys/devices/system/cpu/cpu*/microcode/version 0x1c 0x1c 0x1c 0x1c Furthermore, /proc/cpuinfo reports that the TSX instructions HLE and RTM are still disabled. Probing the capabilities directly pre-suspend: # cpuid | grep HLE HLE hardware lock elision = false HLE hardware lock elision = false HLE hardware lock elision = false HLE hardware lock elision = false And post-resume: # cpuid | grep HLE HLE hardware lock elision = true HLE hardware lock elision = false HLE hardware lock elision = false HLE hardware lock elision = false Since glibc doesn't rely on the kernel, but probes the instructions directly, following can happen: a) An application is started and the glibc code for instruction-detection is run on cpu0 b) TSX is detected and the codepaths using HLE/RTM go live c) The application's process is moved to another cpu (with TSX disabled) d) The process receives SIGILL in libpthread and is terminated Bringing cpu0 off- and back online, effectively re-applying the update, gets the system back into a usable state: # echo 0 > /sys/devices/system/cpu/cpu0/online # echo 1 > /sys/devices/system/cpu/cpu0/online # cpuid | grep HLE HLE hardware lock elision = false HLE hardware lock elision = false HLE hardware lock elision = false HLE hardware lock elision = false Since cpu0 is special-cased on suspend and doesn't follow the regular hotplug routine it seems to not receive the microcode update on resume. Downstream report: https://bugs.archlinux.org/task/42689 https://bbs.archlinux.org/viewtopic.php?pid=1472766
SMT systems exposing multiple logical cpus, aka hyper-threading, don't suffer from this issue since all logical cpus share the microcode. If cpu0 and cpu1 are part of the same core/cpu and cpu0, as the boot-cpu, doesn't receive the microcode update directly, it will still be updated by way of cpu1 post-resume. This doesn't "solve" the problem, it just hides it on SMT systems. This behavior was confirmed by several people and i mention it here in case the behavior in the original post wasn't reproducible for someone.
Hmm, so we're not updating on resume, does the attached patch fix the issue? Thanks.
Created attachment 157641 [details] test patch
With the patch the boot-cpu also receives the microcode-update after resume and everything seems to work as expected: # cpuid | egrep '(HLE|RTM)' HLE hardware lock elision = false RTM: restricted transactional memory = false HLE hardware lock elision = false RTM: restricted transactional memory = false HLE hardware lock elision = false RTM: restricted transactional memory = false HLE hardware lock elision = false RTM: restricted transactional memory = false # dmesg | grep microcode [ 0.000000] CPU0 microcode updated early to revision 0x1c, date = 2014-07-03 [ 0.105384] CPU1 microcode updated early to revision 0x1c, date = 2014-07-03 [ 0.125910] CPU2 microcode updated early to revision 0x1c, date = 2014-07-03 [ 0.146426] CPU3 microcode updated early to revision 0x1c, date = 2014-07-03 [ 0.328417] microcode: CPU0 sig=0x306c3, pf=0x2, revision=0x1c [ 0.328424] microcode: CPU1 sig=0x306c3, pf=0x2, revision=0x1c [ 0.328429] microcode: CPU2 sig=0x306c3, pf=0x2, revision=0x1c [ 0.328434] microcode: CPU3 sig=0x306c3, pf=0x2, revision=0x1c [ 0.328467] microcode: Microcode Update Driver: v2.00 <tigran@aivazian.fsnet.co.uk>, Peter Oruba <SUSPEND> [ 75.855695] CPU0 microcode updated early to revision 0x1c, date = 2014-07-03 [ 75.866809] CPU1 microcode updated early to revision 0x1c, date = 2014-07-03 [ 75.880812] CPU2 microcode updated early to revision 0x1c, date = 2014-07-03 [ 75.894756] CPU3 microcode updated early to revision 0x1c, date = 2014-07-03
Good, thanks for testing. I'll add your Tested-by: tag to the patch. Closing.