Bug 217857 - Uhhuh. NMI received for unknown reason 3d/2d/ on CPU xx
Summary: Uhhuh. NMI received for unknown reason 3d/2d/ on CPU xx
Status: RESOLVED INVALID
Alias: None
Product: Power Management
Classification: Unclassified
Component: cpuidle (show other bugs)
Hardware: AMD Linux
: P3 normal
Assignee: Len Brown
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2023-09-01 15:38 UTC by Janpieter Sollie
Modified: 2023-11-29 09:16 UTC (History)
1 user (show)

See Also:
Kernel Version: 6.5.0
Subsystem:
Regression: Yes
Bisected commit-id:


Attachments
dmesg (186.65 KB, text/plain)
2023-09-01 15:38 UTC, Janpieter Sollie
Details
kernel 6.5.0 config (130.80 KB, text/plain)
2023-09-01 15:38 UTC, Janpieter Sollie
Details

Description Janpieter Sollie 2023-09-01 15:38:25 UTC
Created attachment 305016 [details]
dmesg

seems to be a regression since 6.5 release:
the infamous error message from the kernel on this 32c/64t threadripper:
> [ 2046.269103] perf: interrupt took too long (3141 > 3138), lowering
> kernel.perf_event_max_sample_rate to 63600
> [ 2405.049567] Uhhuh. NMI received for unknown reason 2d on CPU 48.
> [ 2405.049571] Dazed and confused, but trying to continue
> [ 2406.902609] Uhhuh. NMI received for unknown reason 2d on CPU 33.
> [ 2406.902612] Dazed and confused, but trying to continue
> [ 2423.978918] Uhhuh. NMI received for unknown reason 2d on CPU 33.
> [ 2423.978921] Dazed and confused, but trying to continue
> [ 2429.995160] Uhhuh. NMI received for unknown reason 3d on CPU 48.
> [ 2429.995163] Dazed and confused, but trying to continue
> [ 2431.233575] Uhhuh. NMI received for unknown reason 3d on CPU 36.
> [ 2431.233578] Dazed and confused, but trying to continue
> [ 2442.382252] Uhhuh. NMI received for unknown reason 3d on CPU 48.
> [ 2442.382255] Dazed and confused, but trying to continue
> [ 2442.725076] Uhhuh. NMI received for unknown reason 2d on CPU 49.
> [ 2442.725078] Dazed and confused, but trying to continue
> [ 2442.732025] Uhhuh. NMI received for unknown reason 2d on CPU 48.
> [ 2442.732027] Dazed and confused, but trying to continue
> [ 2443.666671] Uhhuh. NMI received for unknown reason 2d on CPU 48.
> [ 2443.666673] Dazed and confused, but trying to continue
> [ 2443.756776] Uhhuh. NMI received for unknown reason 3d on CPU 39.
> [ 2443.756779] Dazed and confused, but trying to continue
> [ 2443.907309] Uhhuh. NMI received for unknown reason 3d on CPU 48.
> [ 2443.907311] Dazed and confused, but trying to continue
> [ 2444.004281] Uhhuh. NMI received for unknown reason 3d on CPU 49.
> [ 2444.004283] Dazed and confused, but trying to continue
> [ 2444.207944] Uhhuh. NMI received for unknown reason 2d on CPU 49.
> [ 2444.207945] Dazed and confused, but trying to continue
> [ 2444.517408] Uhhuh. NMI received for unknown reason 3d on CPU 49.
> [ 2444.517410] Dazed and confused, but trying to continue
> [ 2444.946941] Uhhuh. NMI received for unknown reason 2d on CPU 49.
> [ 2444.946943] Dazed and confused, but trying to continue
> [ 2445.573807] Uhhuh. NMI received for unknown reason 2d on CPU 49.
> [ 2445.573809] Dazed and confused, but trying to continue
> [ 2445.776108] Uhhuh. NMI received for unknown reason 2d on CPU 49.
> [ 2445.776110] Dazed and confused, but trying to continue
> [ 2445.969029] Uhhuh. NMI received for unknown reason 2d on CPU 49.
> [ 2445.969031] Dazed and confused, but trying to continue
> [ 2446.977458] Uhhuh. NMI received for unknown reason 3d on CPU 49.
> [ 2446.977460] Dazed and confused, but trying to continue
> [ 2447.044329] Uhhuh. NMI received for unknown reason 2d on CPU 46.
> [ 2447.044331] Dazed and confused, but trying to continue
> [ 2447.469269] Uhhuh. NMI received for unknown reason 2d on CPU 49.
> [ 2447.469271] Dazed and confused, but trying to continue
> [ 2447.866530] Uhhuh. NMI received for unknown reason 3d on CPU 48.
> [ 2447.866531] Dazed and confused, but trying to continue
> [ 2448.456615] Uhhuh. NMI received for unknown reason 3d on CPU 48.
> [ 2448.456617] Dazed and confused, but trying to continue
> [ 2448.509614] Uhhuh. NMI received for unknown reason 2d on CPU 49.
> [ 2448.509616] Dazed and confused, but trying to continue
> [ 2448.758005] Uhhuh. NMI received for unknown reason 3d on CPU 49.
> [ 2448.758007] Dazed and confused, but trying to continue
> [ 2449.093565] Uhhuh. NMI received for unknown reason 3d on CPU 48.
> [ 2449.093567] Dazed and confused, but trying to continue
> [ 2449.227344] Uhhuh. NMI received for unknown reason 3d on CPU 48.
> [ 2449.227346] Dazed and confused, but trying to continue
> [ 2449.770534] Uhhuh. NMI received for unknown reason 2d on CPU 49.
> [ 2449.770535] Dazed and confused, but trying to continue
> [ 2449.955594] Uhhuh. NMI received for unknown reason 3d on CPU 48.
> [ 2449.955596] Dazed and confused, but trying to continue
> [ 2450.077872] Uhhuh. NMI received for unknown reason 2d on CPU 48.
> [ 2450.077874] Dazed and confused, but trying to continue
> [ 2450.190844] Uhhuh. NMI received for unknown reason 3d on CPU 49.
> [ 2450.190846] Dazed and confused, but trying to continue
> [ 2450.561450] Uhhuh. NMI received for unknown reason 2d on CPU 49.
> [ 2450.561452] Dazed and confused, but trying to continue
> [ 2450.604498] Uhhuh. NMI received for unknown reason 3d on CPU 48.
> [ 2450.604500] Dazed and confused, but trying to continue
> [ 2450.814451] Uhhuh. NMI received for unknown reason 3d on CPU 48.
> [ 2450.814453] Dazed and confused, but trying to continue
> [ 2450.923171] Uhhuh. NMI received for unknown reason 2d on CPU 49.
> [ 2450.923173] Dazed and confused, but trying to continue
> [ 2451.084612] Uhhuh. NMI received for unknown reason 3d on CPU 49.
> [ 2451.084614] Dazed and confused, but trying to continue
> [ 2451.793342] Uhhuh. NMI received for unknown reason 3d on CPU 49.
> [ 2451.793343] Dazed and confused, but trying to continue
> [ 2451.793662] Uhhuh. NMI received for unknown reason 2d on CPU 48.
> [ 2451.793664] Dazed and confused, but trying to continue
> [ 2451.926819] Uhhuh. NMI received for unknown reason 3d on CPU 48.
> [ 2451.926821] Dazed and confused, but trying to continue
> [ 2452.502583] Uhhuh. NMI received for unknown reason 3d on CPU 49.
> [ 2452.502585] Dazed and confused, but trying to continue
> [ 2452.675633] Uhhuh. NMI received for unknown reason 2d on CPU 61.
> [ 2452.675636] Dazed and confused, but trying to continue
> [ 2452.974655] Uhhuh. NMI received for unknown reason 2d on CPU 48.
> [ 2452.974657] Dazed and confused, but trying to continue
> [ 7065.904855] elogind-daemon[2461]: New session c2 of user janpieter.

according to dmesg, this happens without any special reason (I didn't even notice)
some googling points at a ACPI C state problem on AMD CPUs a few years ago
in 5.14 kernels, I didn't see it.
Comment 1 Janpieter Sollie 2023-09-01 15:38:55 UTC
Created attachment 305017 [details]
kernel 6.5.0 config
Comment 2 Janpieter Sollie 2023-09-01 15:40:22 UTC
> some googling points at a ACPI C state problem on AMD CPUs a few years ago
> in 5.14 kernels, I didn't see it.

6.4 kernels that is, sorry for the typo
Comment 3 Bagas Sanjaya 2023-09-02 00:08:34 UTC
(In reply to Janpieter Sollie from comment #0)
> Created attachment 305016 [details]
> dmesg
> 
> seems to be a regression since 6.5 release:
> the infamous error message from the kernel on this 32c/64t threadripper:
> > [ 2046.269103] perf: interrupt took too long (3141 > 3138), lowering
> > kernel.perf_event_max_sample_rate to 63600
> > [ 2405.049567] Uhhuh. NMI received for unknown reason 2d on CPU 48.
> > [ 2405.049571] Dazed and confused, but trying to continue
> > [ 2406.902609] Uhhuh. NMI received for unknown reason 2d on CPU 33.
> > [ 2406.902612] Dazed and confused, but trying to continue
> > [ 2423.978918] Uhhuh. NMI received for unknown reason 2d on CPU 33.
> > [ 2423.978921] Dazed and confused, but trying to continue
> > [ 2429.995160] Uhhuh. NMI received for unknown reason 3d on CPU 48.
> > [ 2429.995163] Dazed and confused, but trying to continue
> > [ 2431.233575] Uhhuh. NMI received for unknown reason 3d on CPU 36.
> > [ 2431.233578] Dazed and confused, but trying to continue
> > [ 2442.382252] Uhhuh. NMI received for unknown reason 3d on CPU 48.
> > [ 2442.382255] Dazed and confused, but trying to continue
> > [ 2442.725076] Uhhuh. NMI received for unknown reason 2d on CPU 49.
> > [ 2442.725078] Dazed and confused, but trying to continue
> > [ 2442.732025] Uhhuh. NMI received for unknown reason 2d on CPU 48.
> > [ 2442.732027] Dazed and confused, but trying to continue
> > [ 2443.666671] Uhhuh. NMI received for unknown reason 2d on CPU 48.
> > [ 2443.666673] Dazed and confused, but trying to continue
> > [ 2443.756776] Uhhuh. NMI received for unknown reason 3d on CPU 39.
> > [ 2443.756779] Dazed and confused, but trying to continue
> > [ 2443.907309] Uhhuh. NMI received for unknown reason 3d on CPU 48.
> > [ 2443.907311] Dazed and confused, but trying to continue
> > [ 2444.004281] Uhhuh. NMI received for unknown reason 3d on CPU 49.
> > [ 2444.004283] Dazed and confused, but trying to continue
> > [ 2444.207944] Uhhuh. NMI received for unknown reason 2d on CPU 49.
> > [ 2444.207945] Dazed and confused, but trying to continue
> > [ 2444.517408] Uhhuh. NMI received for unknown reason 3d on CPU 49.
> > [ 2444.517410] Dazed and confused, but trying to continue
> > [ 2444.946941] Uhhuh. NMI received for unknown reason 2d on CPU 49.
> > [ 2444.946943] Dazed and confused, but trying to continue
> > [ 2445.573807] Uhhuh. NMI received for unknown reason 2d on CPU 49.
> > [ 2445.573809] Dazed and confused, but trying to continue
> > [ 2445.776108] Uhhuh. NMI received for unknown reason 2d on CPU 49.
> > [ 2445.776110] Dazed and confused, but trying to continue
> > [ 2445.969029] Uhhuh. NMI received for unknown reason 2d on CPU 49.
> > [ 2445.969031] Dazed and confused, but trying to continue
> > [ 2446.977458] Uhhuh. NMI received for unknown reason 3d on CPU 49.
> > [ 2446.977460] Dazed and confused, but trying to continue
> > [ 2447.044329] Uhhuh. NMI received for unknown reason 2d on CPU 46.
> > [ 2447.044331] Dazed and confused, but trying to continue
> > [ 2447.469269] Uhhuh. NMI received for unknown reason 2d on CPU 49.
> > [ 2447.469271] Dazed and confused, but trying to continue
> > [ 2447.866530] Uhhuh. NMI received for unknown reason 3d on CPU 48.
> > [ 2447.866531] Dazed and confused, but trying to continue
> > [ 2448.456615] Uhhuh. NMI received for unknown reason 3d on CPU 48.
> > [ 2448.456617] Dazed and confused, but trying to continue
> > [ 2448.509614] Uhhuh. NMI received for unknown reason 2d on CPU 49.
> > [ 2448.509616] Dazed and confused, but trying to continue
> > [ 2448.758005] Uhhuh. NMI received for unknown reason 3d on CPU 49.
> > [ 2448.758007] Dazed and confused, but trying to continue
> > [ 2449.093565] Uhhuh. NMI received for unknown reason 3d on CPU 48.
> > [ 2449.093567] Dazed and confused, but trying to continue
> > [ 2449.227344] Uhhuh. NMI received for unknown reason 3d on CPU 48.
> > [ 2449.227346] Dazed and confused, but trying to continue
> > [ 2449.770534] Uhhuh. NMI received for unknown reason 2d on CPU 49.
> > [ 2449.770535] Dazed and confused, but trying to continue
> > [ 2449.955594] Uhhuh. NMI received for unknown reason 3d on CPU 48.
> > [ 2449.955596] Dazed and confused, but trying to continue
> > [ 2450.077872] Uhhuh. NMI received for unknown reason 2d on CPU 48.
> > [ 2450.077874] Dazed and confused, but trying to continue
> > [ 2450.190844] Uhhuh. NMI received for unknown reason 3d on CPU 49.
> > [ 2450.190846] Dazed and confused, but trying to continue
> > [ 2450.561450] Uhhuh. NMI received for unknown reason 2d on CPU 49.
> > [ 2450.561452] Dazed and confused, but trying to continue
> > [ 2450.604498] Uhhuh. NMI received for unknown reason 3d on CPU 48.
> > [ 2450.604500] Dazed and confused, but trying to continue
> > [ 2450.814451] Uhhuh. NMI received for unknown reason 3d on CPU 48.
> > [ 2450.814453] Dazed and confused, but trying to continue
> > [ 2450.923171] Uhhuh. NMI received for unknown reason 2d on CPU 49.
> > [ 2450.923173] Dazed and confused, but trying to continue
> > [ 2451.084612] Uhhuh. NMI received for unknown reason 3d on CPU 49.
> > [ 2451.084614] Dazed and confused, but trying to continue
> > [ 2451.793342] Uhhuh. NMI received for unknown reason 3d on CPU 49.
> > [ 2451.793343] Dazed and confused, but trying to continue
> > [ 2451.793662] Uhhuh. NMI received for unknown reason 2d on CPU 48.
> > [ 2451.793664] Dazed and confused, but trying to continue
> > [ 2451.926819] Uhhuh. NMI received for unknown reason 3d on CPU 48.
> > [ 2451.926821] Dazed and confused, but trying to continue
> > [ 2452.502583] Uhhuh. NMI received for unknown reason 3d on CPU 49.
> > [ 2452.502585] Dazed and confused, but trying to continue
> > [ 2452.675633] Uhhuh. NMI received for unknown reason 2d on CPU 61.
> > [ 2452.675636] Dazed and confused, but trying to continue
> > [ 2452.974655] Uhhuh. NMI received for unknown reason 2d on CPU 48.
> > [ 2452.974657] Dazed and confused, but trying to continue
> > [ 7065.904855] elogind-daemon[2461]: New session c2 of user janpieter.
> 
> according to dmesg, this happens without any special reason (I didn't even
> notice)
> some googling points at a ACPI C state problem on AMD CPUs a few years ago
> in 5.14 kernels, I didn't see it.

Can you please do bisection (see Documentation/admin-guide/bug-bisect.rst
in the kernel sources for instructions)?
Comment 4 Janpieter Sollie 2023-09-02 03:45:10 UTC
Is there any way I can increase the probability of this to happen?
I installed the 6.5 kernel 24h ago, so far the issue happened only once (system was up all day). I also saw no reason why it would happen.
Bisecting this may be painful.
Comment 5 Bagas Sanjaya 2023-09-08 08:02:41 UTC
(In reply to Janpieter Sollie from comment #4)
> Is there any way I can increase the probability of this to happen?
> I installed the 6.5 kernel 24h ago, so far the issue happened only once
> (system was up all day). I also saw no reason why it would happen.
> Bisecting this may be painful.

Sorry, I can't help you on that. In the mean time, what are you doing
on your system that lead to this bug report?
Comment 6 Janpieter Sollie 2023-09-08 09:24:05 UTC
(In reply to Bagas Sanjaya from comment #5)
> (In reply to Janpieter Sollie from comment #4)
> > Is there any way I can increase the probability of this to happen?
> > I installed the 6.5 kernel 24h ago, so far the issue happened only once
> > (system was up all day). I also saw no reason why it would happen.
> > Bisecting this may be painful.
> 
> Sorry, I can't help you on that. In the mean time, what are you doing
> on your system that lead to this bug report?

Honestly, I do not know:
the system is mostly used as a "jump in in case desktop / laptop isn't powerful enough" device, and to test new features: provide long-term storage, take over DHCP/DNS functionality from the raspberry pi in case that one needs maintenance, provide resources for parking VMs, distcc, and so on. if not used for a longer time (especially during the night), it is powered off and / or put in S3.
at the time of the NMI, it was idle and not doing anything I knew of.
What I do know, is that when it wakes from S3 (I hope it's related), an interrupt is triggered nobody cares about.  Linux advises me to boot with irqpoll, but I first wanted to know what the reason is for the nobody cared interrupt.
Could this be related?
Comment 7 Bagas Sanjaya 2023-10-06 01:17:22 UTC
(In reply to Janpieter Sollie from comment #6)
> (In reply to Bagas Sanjaya from comment #5)
> > (In reply to Janpieter Sollie from comment #4)
> > > Is there any way I can increase the probability of this to happen?
> > > I installed the 6.5 kernel 24h ago, so far the issue happened only once
> > > (system was up all day). I also saw no reason why it would happen.
> > > Bisecting this may be painful.
> > 
> > Sorry, I can't help you on that. In the mean time, what are you doing
> > on your system that lead to this bug report?
> 
> Honestly, I do not know:
> the system is mostly used as a "jump in in case desktop / laptop isn't
> powerful enough" device, and to test new features: provide long-term
> storage, take over DHCP/DNS functionality from the raspberry pi in case that
> one needs maintenance, provide resources for parking VMs, distcc, and so on.
> if not used for a longer time (especially during the night), it is powered
> off and / or put in S3.
> at the time of the NMI, it was idle and not doing anything I knew of.
> What I do know, is that when it wakes from S3 (I hope it's related), an
> interrupt is triggered nobody cares about.  Linux advises me to boot with
> irqpoll, but I first wanted to know what the reason is for the nobody cared
> interrupt.
> Could this be related?

Seems like this is pure hardware issue (see [1]).

[1]: https://lore.kernel.org/all/e08e33d5-4f6d-91aa-f335-9404d16a983c@amd.com/
Comment 8 Janpieter Sollie 2023-10-06 06:36:27 UTC
thank you for the link.
Does this mean if I don't compile with IBS, the error won't show?
Comment 9 Bagas Sanjaya 2023-10-06 12:18:46 UTC
On 06/10/2023 13:36, bugzilla-daemon@kernel.org wrote:
> https://bugzilla.kernel.org/show_bug.cgi?id=217857
> 
> --- Comment #8 from Janpieter Sollie (janpieter.sollie@edpnet.be) ---
> thank you for the link.
> Does this mean if I don't compile with IBS, the error won't show?
> 

This is known hardware issue, see [1]. Maybe you have to upgrade your
CPU to the unaffected ones (like Zen3).

Bye!

[1]: https://lore.kernel.org/all/20210317084829.GA474581@gmail.com/

Note You need to log in before you can comment on or make changes to this bug.