Bug 88001

Summary: Microcode-update not re-applied to boot-CPU after resume from suspend
Product: Power Management Reporter: alex.schnaidt
Component: Hibernation/SuspendAssignee: Borislav Petkov (bp)
Status: CLOSED CODE_FIX    
Severity: high CC: arekm, bp, evangelos, fweimer, marius, thomas, vapier
Priority: P1    
Hardware: All   
OS: Linux   
URL: https://bugs.archlinux.org/task/42689
Kernel Version: 3.17.2-1-ARCH Subsystem:
Regression: No Bisected commit-id:
Attachments: test patch

Description alex.schnaidt 2014-11-10 20:44:28 UTC
In the wake of the Intel TSX bug and the release of the microcode-update one issue cropped up post-resume. The microcode-update is not re-applied to cpu0. On SMP systems this results in cpu0 having TSX instructions enabled and the remaining cpus having them disabled by the microcode-update.

# dmesg | grep microcode
[ 0.000000] CPU0 microcode updated early to revision 0x1c, date = 2014-07-03
[ 0.093040] CPU1 microcode updated early to revision 0x1c, date = 2014-07-03
[ 0.113650] CPU2 microcode updated early to revision 0x1c, date = 2014-07-03
[ 0.134293] CPU3 microcode updated early to revision 0x1c, date = 2014-07-03
[ 0.345610] microcode: CPU0 sig=0x306c3, pf=0x2, revision=0x1c
[ 0.345615] microcode: CPU1 sig=0x306c3, pf=0x2, revision=0x1c
[ 0.345620] microcode: CPU2 sig=0x306c3, pf=0x2, revision=0x1c
[ 0.345623] microcode: CPU3 sig=0x306c3, pf=0x2, revision=0x1c
[ 0.345654] microcode: Microcode Update Driver: v2.00 <tigran@aivazian.fsnet.co.uk>, Peter Oruba
<SUSPEND>
[ 79.131245] CPU1 microcode updated early to revision 0x1c, date = 2014-07-03
[ 79.145367] CPU2 microcode updated early to revision 0x1c, date = 2014-07-03
[ 79.159411] CPU3 microcode updated early to revision 0x1c, date = 2014-07-03

dmesg confirms that cpu0 doesn't receive the update on resume but the kernel is unaware of this and still reports the same microcode version for all cpus (or cores):

  # cat /sys/devices/system/cpu/cpu*/microcode/version
  0x1c
  0x1c
  0x1c
  0x1c 

Furthermore, /proc/cpuinfo reports that the TSX instructions HLE and RTM are still disabled.

Probing the capabilities directly pre-suspend:

  # cpuid | grep HLE
  HLE hardware lock elision = false
  HLE hardware lock elision = false
  HLE hardware lock elision = false
  HLE hardware lock elision = false

And post-resume:

  # cpuid | grep HLE
  HLE hardware lock elision = true
  HLE hardware lock elision = false
  HLE hardware lock elision = false
  HLE hardware lock elision = false

Since glibc doesn't rely on the kernel, but probes the instructions directly, following can happen:

a) An application is started and the glibc code for instruction-detection is run on cpu0
b) TSX is detected and the codepaths using HLE/RTM go live
c) The application's process is moved to another cpu (with TSX disabled)
d) The process receives SIGILL in libpthread and is terminated

Bringing cpu0 off- and back online, effectively re-applying the update, gets the system back into a usable state:

  # echo 0 > /sys/devices/system/cpu/cpu0/online
  # echo 1 > /sys/devices/system/cpu/cpu0/online

  # cpuid | grep HLE
  HLE hardware lock elision = false
  HLE hardware lock elision = false
  HLE hardware lock elision = false
  HLE hardware lock elision = false


Since cpu0 is special-cased on suspend and doesn't follow the regular hotplug routine it seems to not receive the microcode update on resume.

Downstream report:
  https://bugs.archlinux.org/task/42689
  https://bbs.archlinux.org/viewtopic.php?pid=1472766
Comment 1 alex.schnaidt 2014-11-12 15:12:20 UTC
SMT systems exposing multiple logical cpus, aka hyper-threading, don't suffer from this issue since all logical cpus share the microcode.

If cpu0 and cpu1 are part of the same core/cpu and cpu0, as the boot-cpu, doesn't receive the microcode update directly, it will still be updated by way of cpu1 post-resume. This doesn't "solve" the problem, it just hides it on SMT systems.

This behavior was confirmed by several people and i mention it here in case the behavior in the original post wasn't reproducible for someone.
Comment 2 Borislav Petkov 2014-11-14 19:27:25 UTC
Hmm, so we're not updating on resume, does the attached patch fix the issue?

Thanks.
Comment 3 Borislav Petkov 2014-11-14 19:28:05 UTC
Created attachment 157641 [details]
test patch
Comment 4 alex.schnaidt 2014-11-14 21:11:23 UTC
With the patch the boot-cpu also receives the microcode-update after resume and everything seems to work as expected:

  # cpuid | egrep '(HLE|RTM)'
      HLE hardware lock elision                = false
      RTM: restricted transactional memory     = false
      HLE hardware lock elision                = false
      RTM: restricted transactional memory     = false
      HLE hardware lock elision                = false
      RTM: restricted transactional memory     = false
      HLE hardware lock elision                = false
      RTM: restricted transactional memory     = false

  # dmesg | grep microcode
  [    0.000000] CPU0 microcode updated early to revision 0x1c, date = 2014-07-03
  [    0.105384] CPU1 microcode updated early to revision 0x1c, date = 2014-07-03
  [    0.125910] CPU2 microcode updated early to revision 0x1c, date = 2014-07-03
  [    0.146426] CPU3 microcode updated early to revision 0x1c, date = 2014-07-03
  [    0.328417] microcode: CPU0 sig=0x306c3, pf=0x2, revision=0x1c
  [    0.328424] microcode: CPU1 sig=0x306c3, pf=0x2, revision=0x1c
  [    0.328429] microcode: CPU2 sig=0x306c3, pf=0x2, revision=0x1c
  [    0.328434] microcode: CPU3 sig=0x306c3, pf=0x2, revision=0x1c
  [    0.328467] microcode: Microcode Update Driver: v2.00 <tigran@aivazian.fsnet.co.uk>, Peter Oruba
  <SUSPEND>
  [   75.855695] CPU0 microcode updated early to revision 0x1c, date = 2014-07-03
  [   75.866809] CPU1 microcode updated early to revision 0x1c, date = 2014-07-03
  [   75.880812] CPU2 microcode updated early to revision 0x1c, date = 2014-07-03
  [   75.894756] CPU3 microcode updated early to revision 0x1c, date = 2014-07-03
Comment 5 Borislav Petkov 2014-11-14 21:24:40 UTC
Good, thanks for testing. I'll add your Tested-by: tag to the patch.

Closing.