Created attachment 307940 [details] latest.config S4 has started to hang in resume on 20 of our machines and was bisected to this commit. I've tested 6.15.0-rc1 with this one commit reverted and the problem goes away. It also depends on a CONFIG option I've yet to nail down, so I've attached the entire kernel config I use. commit 582077c94052bd69a544b3f9d7619c9c6a67c34b (HEAD, refs/bisect/bad) Author: Peter Zijlstra <peterz@infradead.org> Date: Fri Feb 7 13:15:33 2025 +0100 x86/cfi: Clean up linkage With the introduction of kCFI the addition of ENDBR to SYM_FUNC_START* no longer suffices to make the function indirectly callable. This now requires the use of SYM_TYPED_FUNC_START. As such, remove the implicit ENDBR from SYM_FUNC_START* and add some explicit annotations to fix things up again. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Sami Tolvanen <samitolvanen@google.com> Link: https://lore.kernel.org/r/20250207122546.409116003@infradead.org The error is reproduced by building 6.15-rc1 with this config & running disk: %> sudo sleepgraph -m disk -rtcwake 60 It will eventually reboot and reload the hibernate image, but it will hang right after this in console: [ 12.127821] PM: Image loading progress: 50% [ 12.354286] PM: Image loading progress: 60% [ 12.586796] PM: Image loading progress: 70% [ 12.900275] PM: Image loading progress: 80% [ 13.117556] PM: Image loading progress: 90% [ 13.287038] PM: Image loading progress: 100% [ 13.291336] PM: Image loading done [ 13.294746] PM: hibernation: Read 2156828 kbytes in 2.78 seconds (775.83 MB/s) [ 13.302338] printk: Suspending console(s) (use no_console_suspend to debug) Note that setting /sys/power/disk to test_resume and running disk results in the same failure. So the issue is in the resume call which loads the hibernate image.
Created attachment 307941 [details] final screen of death Reproduced on a Dell XPS-13 9310, and 9315. enabling "no_console_suspend" and taking a movie of the failure, RIP:: 0001:ex_control_protection+2c4/0x2d0 is visible for an instant, before it gets obscured by subsequent stack traces, ending with a by a timekeeping warning and stack trace that fills the screen.
On Wed, Apr 09, 2025 at 02:45:43PM +0000, bugzilla-daemon@kernel.org wrote: > https://bugzilla.kernel.org/show_bug.cgi?id=219998 > > Bug ID: 219998 > Summary: [BISECTED] 6.15.0-rc1 x86 patch breaks S4 hibernate on > 20 machines > Product: Platform Specific/Hardware > Version: 2.5 > Kernel Version: 6.15.0-rc1 > Hardware: Intel > OS: Linux > Status: NEW > Severity: high > Priority: P3 > Component: x86-64 > Assignee: platform_x86_64@kernel-bugs.osdl.org > Reporter: todd.e.brandt@intel.com > CC: a.p.zijlstra@chello.nl > Blocks: 178231 > Regression: Yes > Bisected 582077c94052bd69a544b3f9d7619c9c6a67c34b > commit-id: > > Created attachment 307940 [details] > --> https://bugzilla.kernel.org/attachment.cgi?id=307940&action=edit > latest.config > > S4 has started to hang in resume on 20 of our machines and was bisected to > this > commit. I've tested 6.15.0-rc1 with this one commit reverted and the problem > goes away. It also depends on a CONFIG option I've yet to nail down, so I've > attached the entire kernel config I use. > > commit 582077c94052bd69a544b3f9d7619c9c6a67c34b (HEAD, refs/bisect/bad) > Author: Peter Zijlstra <peterz@infradead.org> > Date: Fri Feb 7 13:15:33 2025 +0100 > > x86/cfi: Clean up linkage > > With the introduction of kCFI the addition of ENDBR to > SYM_FUNC_START* no longer suffices to make the function indirectly > callable. This now requires the use of SYM_TYPED_FUNC_START. > > As such, remove the implicit ENDBR from SYM_FUNC_START* and add some > explicit annotations to fix things up again. > > Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> > Reviewed-by: Sami Tolvanen <samitolvanen@google.com> > Link: https://lore.kernel.org/r/20250207122546.409116003@infradead.org > > The error is reproduced by building 6.15-rc1 with this config & running disk: > %> sudo sleepgraph -m disk -rtcwake 60 > > It will eventually reboot and reload the hibernate image, but it will hang > right after this in console: > > [ 12.127821] PM: Image loading progress: 50% > [ 12.354286] PM: Image loading progress: 60% > [ 12.586796] PM: Image loading progress: 70% > [ 12.900275] PM: Image loading progress: 80% > [ 13.117556] PM: Image loading progress: 90% > [ 13.287038] PM: Image loading progress: 100% > [ 13.291336] PM: Image loading done > [ 13.294746] PM: hibernation: Read 2156828 kbytes in 2.78 seconds (775.83 > MB/s) > [ 13.302338] printk: Suspending console(s) (use no_console_suspend to > debug) > > Note that setting /sys/power/disk to test_resume and running disk results in > the same failure. So the issue is in the resume call which loads the > hibernate > image. Given the .config states it is a GCC build and this patch makes a difference, I'm guessing all those 20 machines support CET and have IBT on. Given the utter lack of useful output, I'm guessing we're tripping #CP before we have an IDT entry set up. Looking at that patch, there's a hunk in hibernate_asm_64.S that might or might not be relevant. Does this help? --- diff --git a/arch/x86/power/hibernate_asm_64.S b/arch/x86/power/hibernate_asm_64.S index 8c534c36adfa..66f066b8feda 100644 --- a/arch/x86/power/hibernate_asm_64.S +++ b/arch/x86/power/hibernate_asm_64.S @@ -26,7 +26,7 @@ /* code below belongs to the image kernel */ .align PAGE_SIZE SYM_FUNC_START(restore_registers) - ANNOTATE_NOENDBR + ENDBR /* go back to the original page tables */ movq %r9, %cr3 @@ -120,7 +120,7 @@ SYM_FUNC_END(restore_image) /* code below has been relocated to a safe page */ SYM_FUNC_START(core_restore_code) - ANNOTATE_NOENDBR + ENDBR /* switch to temporary page tables */ movq %rax, %cr3 /* flush TLB */
On Wed, Apr 09, 2025 at 03:10:35PM +0000, bugzilla-daemon@kernel.org wrote: > https://bugzilla.kernel.org/show_bug.cgi?id=219998 > > Len Brown (lenb@kernel.org) changed: > > What |Removed |Added > ---------------------------------------------------------------------------- > CC| |lenb@kernel.org > > --- Comment #1 from Len Brown (lenb@kernel.org) --- > Created attachment 307941 [details] > --> https://bugzilla.kernel.org/attachment.cgi?id=307941&action=edit > final screen of death > > Reproduced on a Dell XPS-13 9310, and 9315. > > enabling "no_console_suspend" and taking a movie of the failure, > RIP:: 0001:ex_control_protection+2c4/0x2d0 #CP allright. > is visible for an instant, before it gets obscured by subsequent > stack traces, ending with a by a timekeeping warning and stack trace that > fills > the screen. And I'm guessing you don't have serial output? Does that machine have vPro on? In which case you can probably try an AMT console (meshcmd AmtTerm --host foo --pass bar) The thing that is interesting is where that exception comes from, so the stack trace leading up to it.
Created attachment 307942 [details] console-log-for-mtl-m-machine-fail.txt I reran the issue on our otcpl-mtl-m-1 machine which has a console, bumped the log level to 8, and added no_console_suspend. There's an actual bug shown here: [ 13.969307] ------------[ cut here ]------------ [ 13.973902] kernel BUG at arch/x86/kernel/cet.c:132! [ 13.978841] Oops: invalid opcode: 0000 [#1] SMP NOPTI [ 13.983869] CPU: 0 UID: 0 PID: 298 Comm: resume Not tainted 6.15.0-rc1-dirty #1 PREEMPT(voluntary) [ 13.992855] Hardware name: Intel Corporation Meteor Lake Client Platform/MTL-M LP5x CONF1 RVP, BIOS M> [ 14.005634] RIP: 0010:exc_control_protection+0x21f/0x230
This appears to be the bug, I'll try it on another architecture to be sure it's the same: [ 13.521918] Disabling non-boot CPUs ... [ 13.571722] smpboot: CPU 13 is now offline [ 13.619850] smpboot: CPU 12 is now offline [ 13.659415] smpboot: CPU 11 is now offline [ 13.691466] smpboot: CPU 10 is now offline [ 13.731453] smpboot: CPU 9 is now offline [ 13.763451] smpboot: CPU 8 is now offline [ 13.787430] smpboot: CPU 7 is now offline [ 13.827464] smpboot: CPU 6 is now offline [ 13.851626] smpboot: CPU 5 is now offline [ 13.875556] smpboot: CPU 4 is now offline [ 13.907381] smpboot: CPU 3 is now offline [ 13.931328] smpboot: CPU 2 is now offline [ 13.955465] smpboot: CPU 1 is now offline [ 13.964964] Missing ENDBR: 0xffff8b8d1623f000 [ 13.969307] ------------[ cut here ]------------ [ 13.973902] kernel BUG at arch/x86/kernel/cet.c:132! [ 13.978841] Oops: invalid opcode: 0000 [#1] SMP NOPTI [ 13.983869] CPU: 0 UID: 0 PID: 298 Comm: resume Not tainted 6.15.0-rc1-dirty #1 PREEMPT(voluntary) [ 13.992855] Hardware name: Intel Corporation Meteor Lake Client Platform/MTL-M LP5x CONF1 RVP, BIOS MTLMFWI1.R00.3424.D83.2310270500 10/27/2023 [ 14.005634] RIP: 0010:exc_control_protection+0x21f/0x230 [ 14.010921] Code: c7 49 75 cc 9a e8 41 71 06 ff 48 83 c4 18 e9 6e fe ff ff 41 80 a4 24 8a 00 00 00 fb 49 c7 44 24 50 00 00 00 00 e9 99 fe ff ff <0f> 0b 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 90 90 90 90 90 [ 14.029548] RSP: 0018:ffffa93400b47ba0 EFLAGS: 00010002 [ 14.034746] RAX: 0000000000000021 RBX: 0000000000000000 RCX: 0000000000000000 [ 14.041837] RDX: 0000000000000000 RSI: 00000000ffffdfff RDI: 00000000ffffffff [ 14.048927] RBP: ffffa93400b47bc8 R08: 0000000000000000 R09: ffffa93400b47a20 [ 14.056013] R10: ffffa93400b47a18 R11: ffffffff9b161948 R12: ffffa93400b47bd8 [ 14.063104] R13: 0000000000000003 R14: 0000000000000000 R15: 0000000000000000 [ 14.070197] FS: 00007f6ad7ba6740(0000) GS:ffff8b90d3d90000(0000) knlGS:0000000000000000 [ 14.078232] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 14.083949] CR2: 00005586f88e4098 CR3: 0000000101878005 CR4: 0000000000f70ef0 [ 14.091042] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [ 14.098128] DR3: 0000000000000000 DR6: 00000000ffff07f0 DR7: 0000000000000400 [ 14.105218] PKRU: 55555554 [ 14.107914] Call Trace: [ 14.110359] <TASK> [ 14.112453] asm_exc_control_protection+0x2b/0x30 [ 14.117133] RIP: 0010:0xffff8b8d1623f000 [ 14.121035] Code: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 <0f> 22 d8 48 89 d9 48 81 e1 7f ff ff ff 0f 22 e1 0f 20 d9 0f 22 d9 [ 14.139665] RSP: 0018:ffffa93400b47c80 EFLAGS: 00010046 [ 14.144858] RAX: 0000000115d48000 RBX: 00000000000000f0 RCX: ffff8b8d1623f000 [ 14.151946] RDX: ffff8b8d3c030878 RSI: 00000001162001e3 RDI: ffff8b8d01e4c588 [ 14.159032] RBP: ffffa93400b47cf8 R08: ffffffff99b88010 R09: 00000001013d4000 [ 14.166122] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000 [ 14.173214] R13: 0000000000000003 R14: ffff8b8d96280000 R15: 0000000000000063 [ 14.180304] ? destroy_buffers+0xb0/0xb0 [ 14.184212] ? swsusp_arch_resume+0x2ad/0x640 [ 14.188544] ? __pfx_alloc_pgt_page+0x10/0x10 [ 14.192878] ? hibernation_restore+0x130/0x160 [ 14.197301] ? load_image_and_restore+0xa9/0xd0 [ 14.201807] ? software_resume+0x20e/0x2a0 [ 14.205882] ? resume_store+0xf1/0x210 [ 14.209617] ? preempt_count_add+0x52/0xd0 [ 14.213695] ? kobj_attr_store+0x13/0x30 [ 14.217602] ? sysfs_kf_write+0x73/0x90 [ 14.221426] ? kernfs_fop_write_iter+0x13a/0x1c0 [ 14.226021] ? vfs_write+0x322/0x440 [ 14.229585] ? ksys_write+0x6d/0xe0 [ 14.233061] ? __x64_sys_write+0x1d/0x30 [ 14.236964] ? x64_sys_call+0x16ba/0x2150 [ 14.240953] ? do_syscall_64+0x51/0x6f0 [ 14.244770] ? entry_SYSCALL_64_after_hwframe+0x76/0x7e [ 14.249964] </TASK> [ 14.252144] Modules linked in: cdc_ether usbnet r8152 intel_lpss_pci ucsi_acpi intel_lpss thunderbolt mii typec_ucsi idma64 virt_dma typec video pinctrl_meteorlake pinctrl_intel wmi pwm_lpss [ 14.268972] ---[ end trace 0000000000000000 ]---
On Wed, Apr 09, 2025 at 03:31:28PM +0000, bugzilla-daemon@kernel.org wrote: > https://bugzilla.kernel.org/show_bug.cgi?id=219998 > > --- Comment #4 from Todd Brandt (todd.e.brandt@intel.com) --- > Created attachment 307942 [details] > --> https://bugzilla.kernel.org/attachment.cgi?id=307942&action=edit > console-log-for-mtl-m-machine-fail.txt > > I reran the issue on our otcpl-mtl-m-1 machine which has a console, bumped > the > log level to 8, and added no_console_suspend. There's an actual bug shown > here: > > [ 13.969307] ------------[ cut here ]------------ > [ 13.973902] kernel BUG at arch/x86/kernel/cet.c:132! > [ 13.978841] Oops: invalid opcode: 0000 [#1] SMP NOPTI > [ 13.983869] CPU: 0 UID: 0 PID: 298 Comm: resume Not tainted > 6.15.0-rc1-dirty > #1 PREEMPT(voluntary) > [ 13.992855] Hardware name: Intel Corporation Meteor Lake Client > Platform/MTL-M LP5x CONF1 RVP, BIOS M> > [ 14.005634] RIP: 0010:exc_control_protection+0x21f/0x230 Excellent! [ 13.964964] Missing ENDBR: 0xffff8b8d1623f000 [ 13.969307] ------------[ cut here ]------------ [ 13.973902] kernel BUG at arch/x86/kernel/cet.c:132! [ 13.978841] Oops: invalid opcode: 0000 [#1] SMP NOPTI [ 13.983869] CPU: 0 UID: 0 PID: 298 Comm: resume Not tainted 6.15.0-rc1-dirty #1 PREEMPT(voluntary) [ 13.992855] Hardware name: Intel Corporation Meteor Lake Client Platform/MTL-M LP5x CONF1 RVP, BIOS MTLMFWI1.R00.3424.D83.2310270500 10/27/2023 [ 14.005634] RIP: 0010:exc_control_protection+0x21f/0x230 [ 14.010921] Code: c7 49 75 cc 9a e8 41 71 06 ff 48 83 c4 18 e9 6e fe ff ff 41 80 a4 24 8a 00 00 00 fb 49 c7 44 24 50 00 00 00 00 e9 99 fe ff ff <0f> 0b 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 90 90 90 90 90 [ 14.029548] RSP: 0018:ffffa93400b47ba0 EFLAGS: 00010002 [ 14.034746] RAX: 0000000000000021 RBX: 0000000000000000 RCX: 0000000000000000 [ 14.041837] RDX: 0000000000000000 RSI: 00000000ffffdfff RDI: 00000000ffffffff [ 14.048927] RBP: ffffa93400b47bc8 R08: 0000000000000000 R09: ffffa93400b47a20 [ 14.056013] R10: ffffa93400b47a18 R11: ffffffff9b161948 R12: ffffa93400b47bd8 [ 14.063104] R13: 0000000000000003 R14: 0000000000000000 R15: 0000000000000000 [ 14.070197] FS: 00007f6ad7ba6740(0000) GS:ffff8b90d3d90000(0000) knlGS:0000000000000000 [ 14.078232] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 14.083949] CR2: 00005586f88e4098 CR3: 0000000101878005 CR4: 0000000000f70ef0 [ 14.091042] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [ 14.098128] DR3: 0000000000000000 DR6: 00000000ffff07f0 DR7: 0000000000000400 [ 14.105218] PKRU: 55555554 [ 14.107914] Call Trace: [ 14.110359] <TASK> [ 14.112453] asm_exc_control_protection+0x2b/0x30 [ 14.117133] RIP: 0010:0xffff8b8d1623f000 [ 14.121035] Code: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 <0f> 22 d8 48 89 d9 48 81 e1 7f ff ff ff 0f 22 e1 0f 20 d9 0f 22 d9 [ 14.139665] RSP: 0018:ffffa93400b47c80 EFLAGS: 00010046 [ 14.144858] RAX: 0000000115d48000 RBX: 00000000000000f0 RCX: ffff8b8d1623f000 [ 14.151946] RDX: ffff8b8d3c030878 RSI: 00000001162001e3 RDI: ffff8b8d01e4c588 [ 14.159032] RBP: ffffa93400b47cf8 R08: ffffffff99b88010 R09: 00000001013d4000 [ 14.166122] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000 [ 14.173214] R13: 0000000000000003 R14: ffff8b8d96280000 R15: 0000000000000063 [ 14.180304] ? destroy_buffers+0xb0/0xb0 [ 14.184212] ? swsusp_arch_resume+0x2ad/0x640 [ 14.188544] ? __pfx_alloc_pgt_page+0x10/0x10 [ 14.192878] ? hibernation_restore+0x130/0x160 [ 14.197301] ? load_image_and_restore+0xa9/0xd0 [ 14.201807] ? software_resume+0x20e/0x2a0 [ 14.205882] ? resume_store+0xf1/0x210 [ 14.209617] ? preempt_count_add+0x52/0xd0 [ 14.213695] ? kobj_attr_store+0x13/0x30 [ 14.217602] ? sysfs_kf_write+0x73/0x90 [ 14.221426] ? kernfs_fop_write_iter+0x13a/0x1c0 [ 14.226021] ? vfs_write+0x322/0x440 [ 14.229585] ? ksys_write+0x6d/0xe0 [ 14.233061] ? __x64_sys_write+0x1d/0x30 [ 14.236964] ? x64_sys_call+0x16ba/0x2150 [ 14.240953] ? do_syscall_64+0x51/0x6f0 [ 14.244770] ? entry_SYSCALL_64_after_hwframe+0x76/0x7e [ 14.249964] </TASK> So while that stacktrace is a bit wonky, it points me to swsusp_arch_resume(), which calls restore_image(), which tries to do an indirect jump to relocated_restore_code, which per relocate_restore_code() is a copy of core_restore_code. So that little patch I sent earlier, which does s/ANNOTATE_NOENDBR/ENDBR/ in hibernate_asm_64.S, and more specifically, the second hunk, which concerns core_restore_code, should cure this one. Can you confirm?
I will test it in a few minutes, I've just verified the exact same bug occurred on the ADL-S. I will try building it with your fix. Also, any idea which config option would make this go away? Len's default config doesn't see this error and we differ on a lot of our CONFIG_X86 options. otcpl-adl-s-5 console out: [ 11.585547] smpboot: CPU 3 is now offline [ 11.612323] smpboot: CPU 2 is now offline [ 11.645814] smpboot: CPU 1 is now offline [ 11.655547] Missing ENDBR: 0xffff9851553d3000 [ 11.659929] ------------[ cut here ]------------ [ 11.664563] kernel BUG at arch/x86/kernel/cet.c:132! [ 11.669545] Oops: invalid opcode: 0000 [#1] SMP NOPTI [ 11.674615] CPU: 0 UID: 0 PID: 396 Comm: resume Not tainted 6.15.0-rc1-dirty #1 PREEMPT(voluntary) [ 11.683668] Hardware name: Intel Corporation Alder Lake Client Platform/AlderLake-S ADP-S DDR4 UDIMM CRB, BIOS ADLSFWI1.R00.3192.A00.2205032149 05/03/2022 [ 11.697495] RIP: 0010:exc_control_protection+0x21f/0x230 [ 11.702838] Code: c7 49 75 8c b6 e8 41 71 06 ff 48 83 c4 18 e9 6e fe ff ff 41 80 a4 24 8a 00 00 00 fb 49 c7 44 24 50 00 00 00 00 e9 99 fe ff ff <0f> 0b 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 90 90 90 90 90 [ 11.721608] RSP: 0018:ffffafdc80e83ba0 EFLAGS: 00010002 [ 11.726923] RAX: 0000000000000021 RBX: 0000000000000000 RCX: 0000000000000000 [ 11.734071] RDX: 0000000000000000 RSI: 00000000ffffdfff RDI: 00000000ffffffff [ 11.741216] RBP: ffffafdc80e83bc8 R08: 0000000000000000 R09: ffffafdc80e83a20 [ 11.748366] R10: ffffafdc80e83a18 R11: ffffffffb6d61948 R12: ffffafdc80e83bd8 [ 11.755516] R13: 0000000000000003 R14: 0000000000000000 R15: 0000000000000000 [ 11.762666] FS: 00007f5401656740(0000) GS:ffff985528190000(0000) knlGS:0000000000000000 [ 11.770991] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 11.777048] CR2: 0000560d6293e088 CR3: 0000000116578001 CR4: 0000000000f70ef0 [ 11.784482] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [ 11.791919] DR3: 0000000000000000 DR6: 00000000ffff07f0 DR7: 0000000000000400 [ 11.799355] PKRU: 55555554 [ 11.802367] Call Trace: [ 11.805129] <TASK> [ 11.807532] asm_exc_control_protection+0x2b/0x30 [ 11.812536] RIP: 0010:0xffff9851553d3000 [ 11.816754] Code: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 <0f> 22 d8 48 89 d9 48 81 e1 7f ff ff ff 0f 22 e1 0f 20 d9 0f 22 d9 [ 11.836044] RSP: 0018:ffffafdc80e83c80 EFLAGS: 00010046 [ 11.841594] RAX: 00000001026c1000 RBX: 00000000000000f0 RCX: ffff9851553d3000 [ 11.849043] RDX: ffff9851715c3d40 RSI: 00000001152001e3 RDI: ffff98514227b548 [ 11.856496] RBP: ffffafdc80e83cf8 R08: ffffffffa6b88010 R09: 000000012a2b2000 [ 11.863953] R10: 0000000000000004 R11: 0000000000000000 R12: 0000000000000000 [ 11.871408] R13: 0000000000000003 R14: ffff9851d53e6000 R15: 0000000000000063 [ 11.878865] ? swsusp_arch_resume+0x2ad/0x640 [ 11.883550] ? __pfx_alloc_pgt_page+0x10/0x10 [ 11.888235] ? hibernation_restore+0x130/0x160 [ 11.893007] ? load_image_and_restore+0xa9/0xd0 [ 11.897866] ? software_resume+0x20e/0x2a0 [ 11.902284] ? resume_store+0xf1/0x210 [ 11.906360] ? preempt_count_add+0x52/0xd0 [ 11.910777] ? kobj_attr_store+0x13/0x30 [ 11.915023] ? sysfs_kf_write+0x73/0x90 [ 11.919188] ? kernfs_fop_write_iter+0x13a/0x1c0 [ 11.924136] ? vfs_write+0x322/0x440 [ 11.928051] ? ksys_write+0x6d/0xe0 [ 11.931869] ? __x64_sys_write+0x1d/0x30 [ 11.936120] ? x64_sys_call+0x16ba/0x2150 [ 11.940474] ? do_syscall_64+0x51/0x6f0 [ 11.944651] ? entry_SYSCALL_64_after_hwframe+0x76/0x7e [ 11.950207] </TASK> [ 11.952733] Modules linked in: hid_sensor_custom hid_sensor_hub hid_generic intel_ishtp_hid hid spi_pxa2xx_platform dw_dmac dw_dmac_core 8250_dw spi_pxa2xx_core intel_lpss_pci e1000e intel_lpss intel_ish_ipc idma64 intel_ishtp virt_dma video pinctrl_alderlake pinctrl_intel wmi pwm_lpss [ 11.978595] ---[ end trace 0000000000000000 ]---
Len believes it's CONFIG_X86_KERNEL_IBT. If it's set the failure occurs, if it's unset the failure does not. I've yet to test it myself. I'm building with your patch, will test in 20 minutes.
Re: patch in comment #2 Tested-by: Len Brown <len.brown@intel.com> That patch makes my Dell XPS-13 9315 with CONFIG_X86_KERNEL_IBT able to hibernate again.
Annnnd... it's GOOD! Your patch seems to fix it on both the otcpl-mtl-m-1 and otcpl-adl-s-5. I'm now going to run an hour block of back to back S4 hibernates on all 60 of our machines. I want to be sure it fixes it on the other 18 that failed and causes no new problems on the unaffected systems. I'll update when the stress test is done. Thanks!
On Wed, Apr 09, 2025 at 04:13:13PM +0000, bugzilla-daemon@kernel.org wrote: > https://bugzilla.kernel.org/show_bug.cgi?id=219998 > > --- Comment #8 from Todd Brandt (todd.e.brandt@intel.com) --- > Len believes it's CONFIG_X86_KERNEL_IBT. If it's set the failure occurs, if > it's unset the failure does not. I've yet to test it myself. I'm building > with > your patch, will test in 20 minutes. Yes, that is the config option that enables IBT. Without that all that ENDBR nonsense is immaterial.
On Wed, Apr 09, 2025 at 05:11:51PM +0000, bugzilla-daemon@kernel.org wrote: > https://bugzilla.kernel.org/show_bug.cgi?id=219998 > > --- Comment #10 from Todd Brandt (todd.e.brandt@intel.com) --- > Annnnd... it's GOOD! Your patch seems to fix it on both the otcpl-mtl-m-1 and > otcpl-adl-s-5. I'm now going to run an hour block of back to back S4 > hibernates > on all 60 of our machines. I want to be sure it fixes it on the other 18 that > failed and causes no new problems on the unaffected systems. I'll update when > the stress test is done. Thanks! Thank you both. I'll write up a proper patch and get it merged.
I've run on all the machines and all 20 that had hangs in S4 have been fixed. The other machines appear unaffected. Looks great! Thanks for the quick response Peter. Reported-and-Tested-by: Todd Brandt <todd.e.brandt@intel.com>