Bug 215841 - MSR_TSX_FORCE_ABORT not present during resume
Summary: MSR_TSX_FORCE_ABORT not present during resume
Status: RESOLVED CODE_FIX
Alias: None
Product: Platform Specific/Hardware
Classification: Unclassified
Component: x86-64 (show other bugs)
Hardware: x86-64 Linux
: P1 normal
Assignee: platform_x86_64@kernel-bugs.osdl.org
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2022-04-15 18:52 UTC by kyle.d.pelton
Modified: 2022-05-14 10:33 UTC (History)
4 users (show)

See Also:
Kernel Version: v5.18-rc2
Subsystem:
Regression: No
Bisected commit-id:


Attachments
dell-7380 cml system call trace after s3 resume (261.38 KB, text/plain)
2022-04-15 18:52 UTC, kyle.d.pelton
Details
whl_u dmesg (327.55 KB, text/plain)
2022-04-15 20:43 UTC, kyle.d.pelton
Details
dell-p3520 dmesg (299.09 KB, text/plain)
2022-04-15 20:43 UTC, kyle.d.pelton
Details
untested test patch (2.74 KB, patch)
2022-04-18 10:25 UTC, Borislav Petkov
Details | Diff

Description kyle.d.pelton 2022-04-15 18:52:54 UTC
Created attachment 300764 [details]
dell-7380 cml system call trace after s3 resume

The Call trace is seen after resume from S3 after commit  e2a1256b17b16f9b9adf1b6fea56819e7b68e463 on v5.18-rc2.

It reproduced on 3 of our systems. Ive attached the log for the dell 7380 cmlu log.

[   28.703345] unchecked MSR access error: WRMSR to 0x10f (tried to write 0x0000000000000000) at rIP: 0xffffffff9b2819e4 (native_write_msr+0x4/0x20)
[   28.703357] Call Trace:
[   28.703359]  <TASK>
[   28.703361]  ? restore_processor_state+0x255/0x2d0
[   28.703369]  x86_acpi_suspend_lowlevel+0x11f/0x170
[   28.703374]  acpi_suspend_enter+0x4f/0x1f0
[   28.703379]  suspend_devices_and_enter+0x6e0/0x7d0
[   28.703384]  pm_suspend.cold+0x35c/0x3a7
[   28.703388]  state_store+0x81/0xe0
[   28.703392]  kobj_attr_store+0x12/0x20
[   28.703396]  sysfs_kf_write+0x3f/0x50
[   28.703399]  kernfs_fop_write_iter+0x13b/0x1d0
[   28.703403]  new_sync_write+0x101/0x180
[   28.703408]  vfs_write+0x217/0x2a0
[   28.703413]  ksys_write+0x67/0xe0
[   28.703417]  __x64_sys_write+0x1a/0x20
[   28.703421]  do_syscall_64+0x3b/0x90
[   28.703426]  entry_SYSCALL_64_after_hwframe+0x44/0xae
[   28.703429] RIP: 0033:0x7fda13c260a7
[   28.703434] Code: 64 89 02 48 c7 c0 ff ff ff ff eb bb 0f 1f 80 00 00 00 00 f3 0f 1e fa 64 8b 04 25 18 00 00 00 85 c0 75 10 b8 01 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 51 c3 48 83 ec 28 48 89 54 24 18 48 89 74 24
[   28.703437] RSP: 002b:00007fffa4060268 EFLAGS: 00000246 ORIG_RAX: 0000000000000001
[   28.703441] RAX: ffffffffffffffda RBX: 0000000000000004 RCX: 00007fda13c260a7
[   28.703443] RDX: 0000000000000004 RSI: 000055a41f65a570 RDI: 0000000000000004
[   28.703444] RBP: 000055a41f65a570 R08: 0000000000000000 R09: 0000000000000004
[   28.703446] R10: 000055a41f0cc2a6 R11: 0000000000000246 R12: 0000000000000004
[   28.703447] R13: 000055a41f657510 R14: 00007fda13d014a0 R15: 00007fda13d008a0
[   28.703451]  </TASK>
[   28.703454] Call Trace:
[   28.703454]  <TASK>
[   28.703455]  ? restore_processor_state+0x255/0x2d0
[   28.703459]  x86_acpi_suspend_lowlevel+0x11f/0x170
[   28.703463]  acpi_suspend_enter+0x4f/0x1f0
[   28.703468]  suspend_devices_and_enter+0x6e0/0x7d0
[   28.703471]  pm_suspend.cold+0x35c/0x3a7
[   28.703475]  state_store+0x81/0xe0
[   28.703477]  kobj_attr_store+0x12/0x20
[   28.703480]  sysfs_kf_write+0x3f/0x50
[   28.703483]  kernfs_fop_write_iter+0x13b/0x1d0
[   28.703486]  new_sync_write+0x101/0x180
[   28.703491]  vfs_write+0x217/0x2a0
[   28.703495]  ksys_write+0x67/0xe0
[   28.703499]  __x64_sys_write+0x1a/0x20
[   28.703503]  do_syscall_64+0x3b/0x90
[   28.703506]  entry_SYSCALL_64_after_hwframe+0x44/0xae
[   28.703509] RIP: 0033:0x7fda13c260a7
[   28.703511] Code: 64 89 02 48 c7 c0 ff ff ff ff eb bb 0f 1f 80 00 00 00 00 f3 0f 1e fa 64 8b 04 25 18 00 00 00 85 c0 75 10 b8 01 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 51 c3 48 83 ec 28 48 89 54 24 18 48 89 74 24
[   28.703514] RSP: 002b:00007fffa4060268 EFLAGS: 00000246 ORIG_RAX: 0000000000000001
[   28.703516] RAX: ffffffffffffffda RBX: 0000000000000004 RCX: 00007fda13c260a7
[   28.703518] RDX: 0000000000000004 RSI: 000055a41f65a570 RDI: 0000000000000004
[   28.703519] RBP: 000055a41f65a570 R08: 0000000000000000 R09: 0000000000000004
[   28.703520] R10: 000055a41f0cc2a6 R11: 0000000000000246 R12: 0000000000000004
[   28.703522] R13: 000055a41f657510 R14: 00007fda13d014a0 R15: 00007fda13d008a0
[   28.703525]  </TASK>
Comment 1 kyle.d.pelton 2022-04-15 19:00:19 UTC
Reverting the commit fixes the issue.
Comment 2 Borislav Petkov 2022-04-15 19:21:16 UTC
0x10f is MSR_TSX_FORCE_ABORT. And that's weird because if that MSR is not there, it should not be marked as valid and we should not write it on resume. Unless the microcode is playing tricks on us again.

Are you updating microcode before suspending?

Are you willing to run a debug patch and send me full dmesg?

Thx.
Comment 3 kyle.d.pelton 2022-04-15 20:43:25 UTC
Created attachment 300765 [details]
whl_u dmesg
Comment 4 kyle.d.pelton 2022-04-15 20:43:46 UTC
Created attachment 300766 [details]
dell-p3520 dmesg
Comment 5 kyle.d.pelton 2022-04-15 20:44:34 UTC
No we are not updating the microcode. I'll run debug patches on the system if you have one.
Comment 6 Pawan Gupta 2022-04-15 23:09:58 UTC
This CPU does support the MSR_TSX_FORCE_ABORT as enumerated by CPUID.7.0.EDX[13]=1.

# cpuid -r -1 -l 7
CPU:
   0x00000007 0x00: eax=0x00000000 ebx=0x029c67af ecx=0x00000000 edx=0xbc002e00

# rdmsr -a 0x10f
0
0
0
0
0
0
0
0

As the failing MSR is supported by the system but likely the microcode has not made it available yet at MSR restore location during resume.

It is important to restore these MSRs. A possible workaround is to  restore them at a later point in resume process. Working on it.
Comment 7 Borislav Petkov 2022-04-15 23:31:37 UTC
Looking at his dmesg, it has, in that order:

1. the warn from the failed MSR write down the x86_acpi_suspend_lowlevel() path where msr_restore_context() happens, in __restore_processor_state().

2. a bit later:

...
[   28.704447] PM: Calling mce_syscore_resume+0x0/0x30
[   28.704481] PM: Calling mc_bp_resume+0x0/0x50

and this is where the microcode on the BSP is reloaded.

So, if I can trust the timestamps, I guess the microcode reloading should happen a lot earlier, somewhere up in __restore_processor_state()...
Comment 8 Pawan Gupta 2022-04-16 06:54:01 UTC
On Fri, Apr 15, 2022 at 11:31:37PM +0000, bugzilla-daemon@kernel.org wrote:
>https://bugzilla.kernel.org/show_bug.cgi?id=215841
>
>--- Comment #7 from Borislav Petkov (bp@alien8.de) ---
>Looking at his dmesg, it has, in that order:
>
>1. the warn from the failed MSR write down the x86_acpi_suspend_lowlevel()
>path
>where msr_restore_context() happens, in __restore_processor_state().
>
>2. a bit later:
>
>...
>[   28.704447] PM: Calling mce_syscore_resume+0x0/0x30
>[   28.704481] PM: Calling mc_bp_resume+0x0/0x50
>
>and this is where the microcode on the BSP is reloaded.
>
>So, if I can trust the timestamps, I guess the microcode reloading should
>happen a lot earlier, somewhere up in __restore_processor_state()...

Yes, microcode is getting loaded after __restore_processor_state().

Confirmed with a debug build that a call to mc_bp_resume() just before
msr_restore_context() fixes the issue.

Not sure if this is a right solution because now microcode loading will
be attempted twice in resume path. Not registering mc_bp_resume() with
syscore is also not an option as its needed for other use cases:

    1   1188  kernel/kexec_core.c <<kernel_kexec>>
              syscore_resume();
    2    337  kernel/power/hibernate.c <<create_image>>
              syscore_resume();
    3    506  kernel/power/hibernate.c <<resume_target_kernel>>
              syscore_resume();
    4    608  kernel/power/hibernate.c <<hibernation_platform_enter>>
              syscore_resume();
    5    444  kernel/power/suspend.c <<suspend_enter>>
              syscore_resume();
Comment 9 Borislav Petkov 2022-04-16 13:21:17 UTC
(In reply to Pawan Gupta from comment #8)
> Not sure if this is a right solution because now microcode loading will
> be attempted twice in resume path.

Twice is not the problem - it checks the patchlevel before applying and the second time won't do anything.

>  Not registering mc_bp_resume() with
> syscore is also not an option as its needed for other use cases:

Which means that MSR restore must depend on microcode resume because some of the MSRs are "emulated" by microcode.

Which then means, the MSR restoring must happen *every time* mc_bp_resume() runs and *after* it has ran.

Unfortunately, there's no way such a dependency can be expressed with the syscore functionality. At least I don't see it.

Alternatively, we could do a callback similar to microcode_check() where we run msr_restore_context()...
Comment 10 Pawan Gupta 2022-04-18 05:33:11 UTC
On Sat, Apr 16, 2022 at 01:21:17PM +0000, bugzilla-daemon@kernel.org wrote:
>https://bugzilla.kernel.org/show_bug.cgi?id=215841
>
>--- Comment #9 from Borislav Petkov (bp@alien8.de) ---
>(In reply to Pawan Gupta from comment #8)
>> Not sure if this is a right solution because now microcode loading will
>> be attempted twice in resume path.
>
>Twice is not the problem - it checks the patchlevel before applying and the
>second time won't do anything.
>
>>  Not registering mc_bp_resume() with
>> syscore is also not an option as its needed for other use cases:
>
>Which means that MSR restore must depend on microcode resume because some of
>the MSRs are "emulated" by microcode.
>
>Which then means, the MSR restoring must happen *every time* mc_bp_resume()
>runs and *after* it has ran.
>
>Unfortunately, there's no way such a dependency can be expressed with the
>syscore functionality. At least I don't see it.

Right, IIUC syscore would resume in the order entities were registered,
which can't be relied on when we have to maintain strict ordering.

>Alternatively, we could do a callback similar to microcode_check() where we
>run
>msr_restore_context()...

Hmm, when CONFIG_MICROCODE=n there needs to be a fallback path for
msr_restore_context(). For cases when microcode support is disabled, we
would still need to restore the MSRs.

I believe microcode loading should be done pretty early, and the right
place to load it is during restore_processor_context(). If I am not
mistaken, syscore_resume() is always by preceded by
restore_processor_context(). I am more inclined towards your previous
suggestion of loading the microcode in restore_processor_context().

Adding Ashok Raj and Rafael Wysocki if they have any opinion on this.
Comment 11 Borislav Petkov 2022-04-18 10:25:04 UTC
I guess we could do something like the attached and untested patch. Untested because I need to bisect a triplefault on current tip/master first. ;-\
Comment 12 Borislav Petkov 2022-04-18 10:25:57 UTC
Created attachment 300768 [details]
untested test patch

update microcode after processor state has been restored
Comment 13 Pawan Gupta 2022-04-18 18:56:30 UTC
On Mon, Apr 18, 2022 at 10:25:04AM +0000, bugzilla-daemon@kernel.org wrote:
>https://bugzilla.kernel.org/show_bug.cgi?id=215841
>
>--- Comment #11 from Borislav Petkov (bp@alien8.de) ---
>I guess we could do something like the attached and untested patch.

Similar to your test patch I earlier tested below change to be working:

@@ -262,6 +264,7 @@ static void notrace __restore_processor_state(struct saved_context *ctxt)
  	x86_platform.restore_sched_clock_state();
  	mtrr_bp_restore();
  	perf_restore_debug_store();
+	mc_bp_resume();
  	msr_restore_context(ctxt);
  
  	c = &cpu_data(smp_processor_id());

I will share the test build with your patch for Kyle to also verify.

Note You need to log in before you can comment on or make changes to this bug.