Created attachment 305478 [details] Hardware and system information of the affected Thinkpad - kernel v6.6 Note: I'm just a Linux user, I don't work in IT or even write code, so, I'm probably using terms to describe the issue that are not the ones someone who knows code and what the system does under the hood would use. Affected system: Thinkpad, Intel Kaby Lake (i7-7600U) chipset / cpu and onboard gpu (Intel HD 620), no separate graphics card, current bios firmware; running Void Linux, xfce / lightdm Symptom / problem: Since the upgrade to kernel v6.5.5 (from v6.3.13) my system doesn't wake up from standby, i.e. resume from s3 fails 100% of the time. When pressing a key or the power button nothing happens. The LED that indicates different states of the system, keeps indicating standby mode. The only way to use the system again is hard reset by pressing the power button for a few seconds. So, there is no crashing on resume or incomplete resume or only sometimes failing to resume or failing to go into standby in the first place. Granted, this issue was present with kernels before v6.5, but only occasionally and it would not re-appear for many many boot cycles. So, I never had any lead as to why it would happen. I installed kernel v6.4.16 to test for the bug - it's not in there. For further testing I also installed kernel v6.5.2, as this was the first kernel of the 6.5 series available on void linux, (and because the kernel logs mention VT-d for kernel v6.5.5 and v6.5.3, see below). Result: The bug is already in v6.5.2, too. There's only one thing I noticed from comparing logs between kernels v6.5/6.6 vs v6.1/6.3/6.4. In the moment the system goes into standby, if running one of the latter three kernel versions the system would print the following messages: [elogind-daemon] Entering sleep state 'suspend'... [kernel] PM: suspend entry (deep) But with kernels v6.5/6.6, the kernel message is missing, only the elogind-daemon message shows up in the logs. As if the kernel didn't get the memo and thus didn't prepare and didn't listen for the wake-up call to resume. To see, if this is a bug that might be tight to a certain chipset / cpu generation, I tested kernel v6.5 on my old Thinkpad (Intel Sandy Bridge chipset / cpu, and also onboard graphics only). Its BIOS also has VT-d enabled. Interestingly, on that system, resume from standby with kernel v6.5 is no problem, even though its system is set up the same as the current Thinkpad. So, this bug seems to be limited to certain set of chipset / cpu. Which seems feasible, as I couldn't find a bug report on this - not too many seem to be affected. There's an older bug report on similar symptoms, but the cure doesn't work on my system: "intel_iommu=on breaks resume from suspend on several Thinkpad models" https://bugzilla.kernel.org/show_bug.cgi?id=197029 Although it sounds just like what my system is experiencing - apart from the fact that term suspend being sometimes also used to describe hibernation and it is not specified which one is meant in the bug report. So, I was hopeful on the one hand that the (workaround) fix (adding intel_iommu=off to the kernel parameters) would work on my system, too - on the other hand, this bug report was for kernel v4.13, so it's probably not necessarily relevant to similar symptoms with kernel v6.5 and v6.6, respectively. Anyway, adding intel_iommu=off to the kernel parameters didn't change anything on my system. I made, of course, sure once the system was running, that intel_iommu=off was in indeed used as one of the kernel parameters. With this information in mind I did a regular internet search and found some information that in case intel_iommu=off in the kernel parameters doesn't help, disabling VT-d in BIOS might. And in my case it does indeed help avoiding the bug - for both kernel versions, v6.5 and v6.6. Reading some other bug reports and some changelogs, I noticed that iommu and vt-s are connected, to I posted this bug report in drivers/iommu. If it is misplaced here, please feel free to move it to the correct category. I attached a file with the output of some commands I found being used in several other bug reports on here, just in case they might be needed / helpful. Thank you very much for your help in advance!
(In reply to kbugreports from comment #0) > Created attachment 305478 [details] > Hardware and system information of the affected Thinkpad - kernel v6.6 > > Note: > > I'm just a Linux user, I don't work in IT or even write code, so, I'm > probably using terms to describe the issue that are not the ones someone who > knows code and what the system does under the hood would use. > > Affected system: > > Thinkpad, Intel Kaby Lake (i7-7600U) chipset / cpu and onboard gpu (Intel HD > 620), no separate graphics card, current bios firmware; running Void Linux, > xfce / lightdm > > Symptom / problem: > > Since the upgrade to kernel v6.5.5 (from v6.3.13) my system doesn't wake up > from standby, i.e. resume from s3 fails 100% of the time. > When pressing a key or the power button nothing happens. The LED that > indicates different states of the system, keeps indicating standby mode. > The only way to use the system again is hard reset by pressing the power > button for a few seconds. > > So, there is no crashing on resume or incomplete resume or only sometimes > failing to resume or failing to go into standby in the first place. > > Granted, this issue was present with kernels before v6.5, but only > occasionally and it would not re-appear for many many boot cycles. So, I > never had any lead as to why it would happen. > > I installed kernel v6.4.16 to test for the bug - it's not in there. > > For further testing I also installed kernel v6.5.2, as this was the first > kernel of the 6.5 series available on void linux, (and because the kernel > logs mention VT-d for kernel v6.5.5 and v6.5.3, see below). Result: The bug > is already in v6.5.2, too. > > There's only one thing I noticed from comparing logs between kernels > v6.5/6.6 vs v6.1/6.3/6.4. In the moment the system goes into standby, if > running one of the latter three kernel versions the system would print the > following messages: > > [elogind-daemon] Entering sleep state 'suspend'... > [kernel] PM: suspend entry (deep) > > > But with kernels v6.5/6.6, the kernel message is missing, only the > elogind-daemon message shows up in the logs. As if the kernel didn't get the > memo and thus didn't prepare and didn't listen for the wake-up call to > resume. > > > To see, if this is a bug that might be tight to a certain chipset / cpu > generation, I tested kernel v6.5 on my old Thinkpad (Intel Sandy Bridge > chipset / cpu, and also onboard graphics only). Its BIOS also has VT-d > enabled. Interestingly, on that system, resume from standby with kernel v6.5 > is no problem, even though its system is set up the same as the current > Thinkpad. > > So, this bug seems to be limited to certain set of chipset / cpu. Which > seems feasible, as I couldn't find a bug report on this - not too many seem > to be affected. > > > > There's an older bug report on similar symptoms, but the cure doesn't work > on my system: > > "intel_iommu=on breaks resume from suspend on several Thinkpad models" > https://bugzilla.kernel.org/show_bug.cgi?id=197029 > > > Although it sounds just like what my system is experiencing - apart from the > fact that term suspend being sometimes also used to describe hibernation and > it is not specified which one is meant in the bug report. > > So, I was hopeful on the one hand that the (workaround) fix (adding > intel_iommu=off to the kernel parameters) would work on my system, too - on > the other hand, this bug report was for kernel v4.13, so it's probably not > necessarily relevant to similar symptoms with kernel v6.5 and v6.6, > respectively. > > Anyway, adding intel_iommu=off to the kernel parameters didn't change > anything on my system. I made, of course, sure once the system was running, > that intel_iommu=off was in indeed used as one of the kernel parameters. > > > With this information in mind I did a regular internet search and found some > information that in case intel_iommu=off in the kernel parameters doesn't > help, disabling VT-d in BIOS might. > And in my case it does indeed help avoiding the bug - for both kernel > versions, v6.5 and v6.6. > > Reading some other bug reports and some changelogs, I noticed that iommu and > vt-s are connected, to I posted this bug report in drivers/iommu. If it is > misplaced here, please feel free to move it to the correct category. > > > I attached a file with the output of some commands I found being used in > several other bug reports on here, just in case they might be needed / > helpful. > > > Thank you very much for your help in advance! Can you please perform bisection (see Documentation/admin-guide/bug-bisect.rst for reference)?
I'll try and report back.
(In reply to kbugreports from comment #2) > I'll try and report back. Please do so; I guess the developers won't look into this. Make sure to build all your kernels with a config file from a working kernel (e.g. save it somewhere and copy it into your tree at every step and run "make oldconfig" before building the kernel): it's possible that this is not a regression and caused by your distribution kernels enabling some feature that was present beforehand.
Yeah, I'm on it. I have to look into the config stuff. I have done some runs already, found some bad and some good ones, starting out with the distribution's default config for v6.5. But what am I doing wrong if for each test step the complete kernel gets compiled instead of just the parts that are different from the previous state? I don't clean up, so all the files from the previous compiling are still all there... This way, it takes over 2h for each step just to compile the next kernel... I would understand if git bisect is rebasing / resetting the git (or whatever the correct terminology is) with every step, that a complete compiling has to happen, but others seem to be able to do a complete bisect in a few hours or even under 1 hour (https://ldpreload.com/blog/git-bisect-run). And they are basically doing the same thing (automating the process via git bisect run shouldn't change the fundamentals of the process), aren't they? Yes, I could trim the config down by customizing it, but that wouldn't scrap off too much of the time and would still lead to that kernel to be compiled completely each time. Also, having a generic kernel as it is delivered by my distribution is ensuring that I'm not excluding e.g. some AMD stuff that might hypothetically in a weird way causing the issue.
I'd suggest you trim your config for the bisection, it can speed things up a lot: https://docs.kernel.org/admin-guide/quickly-build-trimmed-linux.html That much or apparently "everything" is compiled during a bisection for mainline is normal during the first steps, as the Linux kernel has no linear history; that's why a bisection between 6.5 and 6.6 will sometimes give you versions to test that self-identify as 6.4-rc. I'm sure that's explained on the web somewhere, but I have no link at hand for that, sory.
Thanks for the reply and help! Yeah, I was experiencing that. A v6.4-RC2 tagged one was one of the states git bisect chose for testing. But I'm checking with gitk, so I'm aware of the "convoluted mess" (for a bystander like me) that is such a complex thing as the kernel with many sub-parts, many contributors and several maintained versions... I might try the trimmed config version.
Bisect is done. Here is the log: git bisect start # Status: warte auf guten und schlechten Commit # good: [45a3e24f65e90a047bef86f927ebdc4c710edaa1] Linux 6.4-rc7 git bisect good 45a3e24f65e90a047bef86f927ebdc4c710edaa1 # Status: warte auf schlechten Commit, 1 guter Commit bekannt # bad: [d35ac6ac0e80e55bcea79af18d935f19a3e8554c] Merge tag 'iommu-updates-v6.5' of git://git.kernel.org/pub/scm/linux/kernel/git/joro/iommu git bisect bad d35ac6ac0e80e55bcea79af18d935f19a3e8554c # good: [44c026a73be8038f03dbdeef028b642880cf1511] Linux 6.4-rc3 git bisect good 44c026a73be8038f03dbdeef028b642880cf1511 # bad: [3a8a670eeeaa40d87bd38a587438952741980c18] Merge tag 'net-next-6.5' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next git bisect bad 3a8a670eeeaa40d87bd38a587438952741980c18 # bad: [6e17c6de3ddf3073741d9c91a796ee696914d8a0] Merge tag 'mm-stable-2023-06-24-19-15' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm git bisect bad 6e17c6de3ddf3073741d9c91a796ee696914d8a0 # bad: [f810c182366acd2eb7eb5efb3c06b1fc9f719835] Merge tag 'm68k-for-v6.5-tag1' of git://git.kernel.org/pub/scm/linux/kernel/git/geert/linux-m68k git bisect bad f810c182366acd2eb7eb5efb3c06b1fc9f719835 # good: [cc423f6337d0a5ff1906f3b3d465d28c0d1705f6] Merge tag 'for-6.5-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux git bisect good cc423f6337d0a5ff1906f3b3d465d28c0d1705f6 # good: [017fb83ee0612595ec70c65ddd83472706b02a50] block: Improve kernel-doc headers git bisect good 017fb83ee0612595ec70c65ddd83472706b02a50 # bad: [9244724fbf8ab394a7210e8e93bf037abc859514] Merge tag 'smp-core-2023-06-26' of ssh://gitolite.kernel.org/pub/scm/linux/kernel/git/tip/tip git bisect bad 9244724fbf8ab394a7210e8e93bf037abc859514 # good: [0017387938993553fe8e08bd9bcf398fb609d136] Merge tag 'irq-core-2023-06-26' of ssh://gitolite.kernel.org/pub/scm/linux/kernel/git/tip/tip git bisect good 0017387938993553fe8e08bd9bcf398fb609d136 # good: [f54d4434c281f38b975d58de47adeca671beff4f] x86/apic: Provide cpu_primary_thread mask git bisect good f54d4434c281f38b975d58de47adeca671beff4f # good: [1703db2b90c91b2eb2d699519fc505fe431dde0e] x86/fpu: Mark init functions __init git bisect good 1703db2b90c91b2eb2d699519fc505fe431dde0e # bad: [5da80b28bf25c3458c7beb23794ff53622ce7eb4] x86/smp: Initialize cpu_primary_thread_mask late git bisect bad 5da80b28bf25c3458c7beb23794ff53622ce7eb4 # good: [7e75178a0950c5ceffa2ca3225701b69752f7d3a] x86/smpboot: Support parallel startup of secondary CPUs git bisect good 7e75178a0950c5ceffa2ca3225701b69752f7d3a # bad: [6a4be6984595b164b6f281c5b242dbdf1c06d528] x86/apic: Fix use of X{,2}APIC_ENABLE in asm with older binutils git bisect bad 6a4be6984595b164b6f281c5b242dbdf1c06d528 # bad: [0c7ffa32dbd6b09a87fea4ad1de8b27145dfd9a6] x86/smpboot/64: Implement arch_cpuhp_init_parallel_bringup() and enable it git bisect bad 0c7ffa32dbd6b09a87fea4ad1de8b27145dfd9a6 # first bad commit: [0c7ffa32dbd6b09a87fea4ad1de8b27145dfd9a6] x86/smpboot/64: Implement arch_cpuhp_init_parallel_bringup() and enable it And here is the bad commit: git bisect bad 0c7ffa32dbd6b09a87fea4ad1de8b27145dfd9a6 is the first bad commit commit 0c7ffa32dbd6b09a87fea4ad1de8b27145dfd9a6 Author: Thomas Gleixner <tglx@linutronix.de> Date: Fri May 12 23:07:56 2023 +0200 x86/smpboot/64: Implement arch_cpuhp_init_parallel_bringup() and enable it Implement the validation function which tells the core code whether parallel bringup is possible. The only condition for now is that the kernel does not run in an encrypted guest as these will trap the RDMSR via #VC, which cannot be handled at that point in early startup. There was an earlier variant for AMD-SEV which used the GHBC protocol for retrieving the APIC ID via CPUID, but there is no guarantee that the initial APIC ID in CPUID is the same as the real APIC ID. There is no enforcement from the secure firmware and the hypervisor can assign APIC IDs as it sees fit as long as the ACPI/MADT table is consistent with that assignment. Unfortunately there is no RDMSR GHCB protocol at the moment, so enabling AMD-SEV guests for parallel startup needs some more thought. Intel-TDX provides a secure RDMSR hypercall, but supporting that is outside the scope of this change. Fixup announce_cpu() as e.g. on Hyper-V CPU1 is the secondary sibling of CPU0, which makes the @cpu == 1 logic in announce_cpu() fall apart. [ mikelley: Reported the announce_cpu() fallout Originally-by: David Woodhouse <dwmw@amazon.co.uk> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Tested-by: Michael Kelley <mikelley@microsoft.com> Tested-by: Oleksandr Natalenko <oleksandr@natalenko.name> Tested-by: Helge Deller <deller@gmx.de> # parisc Tested-by: Guilherme G. Piccoli <gpiccoli@igalia.com> # Steam Deck Link: https://lore.kernel.org/r/20230512205257.467571745@linutronix.de arch/x86/Kconfig | 3 +- arch/x86/kernel/cpu/common.c | 6 +-- arch/x86/kernel/smpboot.c | 87 ++++++++++++++++++++++++++++++++++++-------- 3 files changed, 75 insertions(+), 21 deletions(-)
(In reply to kbugreports from comment #7) > Bisect is done. > > Here is the log: > > git bisect start > # Status: warte auf guten und schlechten Commit > # good: [45a3e24f65e90a047bef86f927ebdc4c710edaa1] Linux 6.4-rc7 > git bisect good 45a3e24f65e90a047bef86f927ebdc4c710edaa1 > # Status: warte auf schlechten Commit, 1 guter Commit bekannt > # bad: [d35ac6ac0e80e55bcea79af18d935f19a3e8554c] Merge tag > 'iommu-updates-v6.5' of > git://git.kernel.org/pub/scm/linux/kernel/git/joro/iommu > git bisect bad d35ac6ac0e80e55bcea79af18d935f19a3e8554c > # good: [44c026a73be8038f03dbdeef028b642880cf1511] Linux 6.4-rc3 > git bisect good 44c026a73be8038f03dbdeef028b642880cf1511 > # bad: [3a8a670eeeaa40d87bd38a587438952741980c18] Merge tag 'net-next-6.5' > of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next > git bisect bad 3a8a670eeeaa40d87bd38a587438952741980c18 > # bad: [6e17c6de3ddf3073741d9c91a796ee696914d8a0] Merge tag > 'mm-stable-2023-06-24-19-15' of > git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm > git bisect bad 6e17c6de3ddf3073741d9c91a796ee696914d8a0 > # bad: [f810c182366acd2eb7eb5efb3c06b1fc9f719835] Merge tag > 'm68k-for-v6.5-tag1' of > git://git.kernel.org/pub/scm/linux/kernel/git/geert/linux-m68k > git bisect bad f810c182366acd2eb7eb5efb3c06b1fc9f719835 > # good: [cc423f6337d0a5ff1906f3b3d465d28c0d1705f6] Merge tag 'for-6.5-tag' > of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux > git bisect good cc423f6337d0a5ff1906f3b3d465d28c0d1705f6 > # good: [017fb83ee0612595ec70c65ddd83472706b02a50] block: Improve kernel-doc > headers > git bisect good 017fb83ee0612595ec70c65ddd83472706b02a50 > # bad: [9244724fbf8ab394a7210e8e93bf037abc859514] Merge tag > 'smp-core-2023-06-26' of > ssh://gitolite.kernel.org/pub/scm/linux/kernel/git/tip/tip > git bisect bad 9244724fbf8ab394a7210e8e93bf037abc859514 > # good: [0017387938993553fe8e08bd9bcf398fb609d136] Merge tag > 'irq-core-2023-06-26' of > ssh://gitolite.kernel.org/pub/scm/linux/kernel/git/tip/tip > git bisect good 0017387938993553fe8e08bd9bcf398fb609d136 > # good: [f54d4434c281f38b975d58de47adeca671beff4f] x86/apic: Provide > cpu_primary_thread mask > git bisect good f54d4434c281f38b975d58de47adeca671beff4f > # good: [1703db2b90c91b2eb2d699519fc505fe431dde0e] x86/fpu: Mark init > functions __init > git bisect good 1703db2b90c91b2eb2d699519fc505fe431dde0e > # bad: [5da80b28bf25c3458c7beb23794ff53622ce7eb4] x86/smp: Initialize > cpu_primary_thread_mask late > git bisect bad 5da80b28bf25c3458c7beb23794ff53622ce7eb4 > # good: [7e75178a0950c5ceffa2ca3225701b69752f7d3a] x86/smpboot: Support > parallel startup of secondary CPUs > git bisect good 7e75178a0950c5ceffa2ca3225701b69752f7d3a > # bad: [6a4be6984595b164b6f281c5b242dbdf1c06d528] x86/apic: Fix use of > X{,2}APIC_ENABLE in asm with older binutils > git bisect bad 6a4be6984595b164b6f281c5b242dbdf1c06d528 > # bad: [0c7ffa32dbd6b09a87fea4ad1de8b27145dfd9a6] x86/smpboot/64: Implement > arch_cpuhp_init_parallel_bringup() and enable it > git bisect bad 0c7ffa32dbd6b09a87fea4ad1de8b27145dfd9a6 > # first bad commit: [0c7ffa32dbd6b09a87fea4ad1de8b27145dfd9a6] > x86/smpboot/64: Implement arch_cpuhp_init_parallel_bringup() and enable it > > > > > > And here is the bad commit: > > git bisect bad > 0c7ffa32dbd6b09a87fea4ad1de8b27145dfd9a6 is the first bad commit > commit 0c7ffa32dbd6b09a87fea4ad1de8b27145dfd9a6 > Author: Thomas Gleixner <tglx@linutronix.de> > Date: Fri May 12 23:07:56 2023 +0200 > > x86/smpboot/64: Implement arch_cpuhp_init_parallel_bringup() and enable > it > > Implement the validation function which tells the core code whether > parallel bringup is possible. > > The only condition for now is that the kernel does not run in an > encrypted > guest as these will trap the RDMSR via #VC, which cannot be handled at > that > point in early startup. > > There was an earlier variant for AMD-SEV which used the GHBC protocol for > retrieving the APIC ID via CPUID, but there is no guarantee that the > initial APIC ID in CPUID is the same as the real APIC ID. There is no > enforcement from the secure firmware and the hypervisor can assign APIC > IDs > as it sees fit as long as the ACPI/MADT table is consistent with that > assignment. > > Unfortunately there is no RDMSR GHCB protocol at the moment, so enabling > AMD-SEV guests for parallel startup needs some more thought. > > Intel-TDX provides a secure RDMSR hypercall, but supporting that is > outside > the scope of this change. > > Fixup announce_cpu() as e.g. on Hyper-V CPU1 is the secondary sibling of > CPU0, which makes the @cpu == 1 logic in announce_cpu() fall apart. > > [ mikelley: Reported the announce_cpu() fallout > > Originally-by: David Woodhouse <dwmw@amazon.co.uk> > Signed-off-by: Thomas Gleixner <tglx@linutronix.de> > Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> > Tested-by: Michael Kelley <mikelley@microsoft.com> > Tested-by: Oleksandr Natalenko <oleksandr@natalenko.name> > Tested-by: Helge Deller <deller@gmx.de> # parisc > Tested-by: Guilherme G. Piccoli <gpiccoli@igalia.com> # Steam Deck > Link: https://lore.kernel.org/r/20230512205257.467571745@linutronix.de > > arch/x86/Kconfig | 3 +- > arch/x86/kernel/cpu/common.c | 6 +-- > arch/x86/kernel/smpboot.c | 87 > ++++++++++++++++++++++++++++++++++++-------- > 3 files changed, 75 insertions(+), 21 deletions(-) There is a proposed fix at [1]. Please test. [1]: https://lore.kernel.org/all/20231026170330.4657-1-mario.limonciello@amd.com/
> There is a proposed fix at [1]. Please test. > > [1]: > https://lore.kernel.org/all/20231026170330.4657-1-mario.limonciello@amd.com/ I can test it, but are you sure this is worth considering? It reads to me as if Thomas Gleixner is happy with the proposed fix. Also, at one point the one who wrote the code writes "I've reviewed it with the internal team and confirmed there was a BIOS bug where the MSR wasn't restored after the S3 cycle completed. The BIOS team has fixed it." Which sounds to me that the problem he was working on was caused by something else and that has been fixed by another group. Not sure though if this means that this "fix" (from at least the end of October) should be in the releases by now. If so, it would mean the issue that I'm (still) seeing must be something else. But if the fix isn't considered "good", parts of it even superfluous and won't be integrated anyway, what's the point, really? The only "helpful" outcome would probably if it doesn't fix my problem. Then we would know that these are separate issues.
Yeah, I "fixed" the sentence the wrong way. It should read: It reads to me as if Thomas Gleixner _isn't_ happy with the proposed fix.(In reply to kbugreports from comment #9) > > There is a proposed fix at [1]. Please test. > > > > [1]: > > > https://lore.kernel.org/all/20231026170330.4657-1-mario.limonciello@amd.com/ > > I can test it, but are you sure this is worth considering? It reads to me as > if Thomas Gleixner is happy with the proposed fix. > > Also, at one point the one who wrote the code writes > > "I've reviewed it with the internal team and confirmed there was a BIOS bug > where the MSR wasn't restored after the S3 cycle completed. The BIOS team > has fixed it." > > Which sounds to me that the problem he was working on was caused by > something else and that has been fixed by another group. Not sure though if > this means that this "fix" (from at least the end of October) should be in > the releases by now. If so, it would mean the issue that I'm (still) seeing > must be something else. > > But if the fix isn't considered "good", parts of it even superfluous and > won't be integrated anyway, what's the point, really? > > The only "helpful" outcome would probably if it doesn't fix my problem. Then > we would know that these are separate issues. Yeah, one of my sentences was only partly re-written, inverting its meaning. It should read: It reads to me as if Thomas Gleixner _isn't_ happy with the proposed fix.
Ok, so the fix seems to work. I used in on v6.6.4. As that version isn't yet available on my distribution I haven't tested the unpatched version. But I read the changelog earlier and even though it addresses some resume issues I don't think this one was addressed. Once 6.6.4 is available for my distro I'll install it just to check. I just don't want to compile yet another kernel right now. So, thanks for the help. We'll see if this fix is being merged or not (see my previous thoughts / interpretation of the discussion in link that went on with the fix).
For future reference and more context, I add below the links to the initial internal bug report [1] and first draft of the proposed fix [2]. The internal bug / regression report concerns AMD systems. Assuming it's the same issue, it's really AMD and Intel, when considering my bug report. [1]: https://lore.kernel.org/all/3d96c70e-da3b-49c2-a776-930a9f1b815d@amd.com/ [2]: https://lore.kernel.org/all/20231023160018.164054-1-mario.limonciello@amd.com/
Remind me please: does this still happen with latest 6.7-rc as well (e.g. rc3 or later)?
kbugreports, it would really help a lot if you could check if 6.7-rc is affected; I fear developers might ignore this if you don't try that.
This will probably not happen any time soon, as zfs isn't ready for kernel 6.7, yet. Apparently, they found some changes in the kernel they have to adapt to. Sorry.
(In reply to kbugreports from comment #15) > as zfs isn't ready for kernel 6.7 What? All your testing was with and out of tree kernel module? Then no developer might look into this. I'll bring it up anyway.
Update on the issue: I just saw this internal bug report [1] on an issue that found the same commit to be the culprit. Thomas Gleixner proposed a patch, which I assume, coming from him, to be kind of an "official patch, intended for release", so I tested it. To be sure it's not the trimmed-down (localmod) build-config I'm using for faster builds, I built v6.6.7 with and without the patch applied. Result: The proposed patch resolves the issue I was experiencing since the commit mentioned above was introduced with v6.5. This is the patch I used: --- a/arch/x86/kernel/head_64.S +++ b/arch/x86/kernel/head_64.S @@ -268,6 +268,22 @@ SYM_INNER_LABEL(secondary_startup_64_no_ testl $X2APIC_ENABLE, %eax jnz .Lread_apicid_msr +#ifdef CONFIG_X86_X2APIC + /* + * If system is in X2APIC mode then MMIO base migt not be + * mapped causing the MMIO read below to fault. Faults can't + * be handled at that point. + */ + cmpl $0, x2apic_mode(%rip) + jz .Lread_apicid_mmio + + /* Force the AP into X2APIC mode. */ + orl $X2APIC_ENABLE, %eax + wrmsr + jmp .Lread_apicid_msr +#endif + +.Lread_apicid_mmio: /* Read the APIC ID from the fix-mapped MMIO space. */ movq apic_mmio_base(%rip), %rcx addq $APIC_ID, %rcx [1]: https://lore.kernel.org/all/87cyv7on8t.ffs@tglx/t/#m004271eb2b5563f477b4fea99ea66fecb5477275
Here's [1, 2] some more info on the fix / the background of the issue it addresses. Apparently the "bad commit" (see above) leads to different issues on different systems. On the system mentioned in the links it lead to not even booting, so more serious than "just" not waking up from standby as on my system. I assume the patch will be included in one of the next versions of v6.6.x and in v6.7. Here's the patch in full, typo fixed: diff --git a/arch/x86/kernel/head_64.S b/arch/x86/kernel/head_64.S index 086a2c3aaaa04..0f8103240fda3 100644 --- a/arch/x86/kernel/head_64.S +++ b/arch/x86/kernel/head_64.S @@ -255,6 +255,22 @@ SYM_INNER_LABEL(secondary_startup_64_no_verify, SYM_L_GLOBAL) testl $X2APIC_ENABLE, %eax jnz .Lread_apicid_msr +#ifdef CONFIG_X86_X2APIC + /* + * If system is in X2APIC mode then MMIO base might not be + * mapped causing the MMIO read below to fault. Faults can't + * be handled at that point. + */ + cmpl $0, x2apic_mode(%rip) + jz .Lread_apicid_mmio + + /* Force the AP into X2APIC mode. */ + orl $X2APIC_ENABLE, %eax + wrmsr + jmp .Lread_apicid_msr +#endif + +.Lread_apicid_mmio: /* Read the APIC ID from the fix-mapped MMIO space. */ movq apic_mmio_base(%rip), %rcx addq $APIC_ID, %rcx [1: Message on email server]: https://lore.kernel.org/all/170273153180.398.6629279525112148301.tip-bot2@tip-bot2/t/#u [2: Same message on git]: https://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git/commit/?id=69a7386c1ec25476a0c78ffeb59de08a2a08f495
The fix has been merged by Linux Torvalds into the kernel on his git for v6.7-RC7 [1]: Merge tag 'x86-urgent-2023-12-23' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip The title of the specific fix is: x86/smpboot/64: Handle X2APIC BIOS inconsistency gracefully [1]: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=3f82f1c3a03694800a4104ca6b6d3282bd4e213d
How that the fix has been merged into kernel v6.6.9 and the issue no longer occurs, the issue is considered fixed. For future reference here is a short summary of the issue: Problem: On the respective system (Thinkpad, Kaby Lake cpu, latest BIOS from September of 2023) the issue would show as the system not being able to wake up from standby (s3) by any means with kernel versions from v6.5 to 6.6.8. Cause: The issue has been tracked down via bisecting to the commit 0c7ffa32dbd6b09a87fea4ad1de8b27145dfd9a6 - x86/smpboot/64: Implement arch_cpuhp_init_parallel_bringup() and enable it https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?id=0c7ffa32dbd6b09a87fea4ad1de8b27145dfd9a6 Fix: The issue has been fixed by the following patch: Commit 69a7386c1ec25476a0c78ffeb59de08a2a08f495 - x86/smpboot/64: Handle X2APIC BIOS inconsistency gracefully https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?id=69a7386c1ec25476a0c78ffeb59de08a2a08f495 According to the notes in the patch, the "bad" commit didn't really introduce a bug but essentially laid bare a BIOS inconsistency in some BIOSes / on some systems. This also explains why most systems were apparently unaffected and why affected systems showed different symptoms. Apparently, some systems would even fail to boot in the first place. Contrary to my initial suspicions the issue had nothing to do with iommu / vt-d even though disabling vt-d in the BIOS made the symptom of not being able to resume from standby go away. Thanks to Thomas Gleixner and his team for fixing this issue!
Report after ~6 weeks: Since the issue was fixed, even the occasional occurrences of this issue in the past (with kernel versions <= 6.5) have not re-occurred thus far. There hasn't been a single instance in which the system did not wake up from standby since the fix in v6.6.9. A very happy and thankful Linux user.