Bug 215337 - NUC8i7BEH randomly freeze when idle (except outdated kernel 5.9.16)
Summary: NUC8i7BEH randomly freeze when idle (except outdated kernel 5.9.16)
Status: NEW
Alias: None
Product: Platform Specific/Hardware
Classification: Unclassified
Component: x86-64 (show other bugs)
Hardware: Intel Linux
: P1 normal
Assignee: platform_x86_64@kernel-bugs.osdl.org
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2021-12-16 00:55 UTC by Dmitry
Modified: 2022-11-27 21:30 UTC (History)
4 users (show)

See Also:
Kernel Version: 5.15.3
Subsystem:
Regression: No
Bisected commit-id:


Attachments

Description Dmitry 2021-12-16 00:55:09 UTC
My Intel NUC8i7BEH randomly freezes when idle on 5.15.3 and 5.13.0, but works fine on 5.9.16.

The freezes only happen when the PC is left unattended. It can crash after various amount of idle time, from just 30 minutes up to a full day or even a week of uptime.

I have tried intel_idle.max_cstate=1 with no improvement.

Another user of similar hardware (NUC8i5BEH) reports that all kernel versions above 5.9 have this issue: https://www.reddit.com/r/intelnuc/comments/npua35/nuc8i5beh_running_linux_randomly_freezes_when/
Comment 1 Borislav Petkov 2021-12-31 17:50:56 UTC
Can you connect it to another machine over serial and catch dmesg when it crashes?

It is kinda hard to debug something without knowing where to start...

Thx.
Comment 2 Bruno Gravato 2022-01-13 01:39:24 UTC
Hi I made that post on reddit that Dmitry linked.

I've been struggling with this for about an year.

I've tried many versions of kernel 5.10, 5.11, 5.12, 5.13, 5.14 and 5.15. The result is always the same: eventually freezes during idle time (it can take 30 min of idle time, a few hours or a few days...).

The one kernel that never crashed was 5.9.15 (at the time installed from Debian buster-backports). I've run it for nearly one year, sometimes with uptimes of a full month without any crash.

When it happens, there's nothing in the logs (just normal logs before it froze). The PC is still power on, but no activity (no image in the screen, disk activity light never blinks, doesn't reply to icmp pings over the network).

Recently I started playing with some BIOS settings. Last one I changed was one that said "legacy standby mode" and I changed it to "modern standby mode".

I'm currently trying kernel 5.10.84 (latest on Debian stable), so far with an uptime of 11 days, which is almost a record. Not sure if I'm just on a hot lucky streak or if this "modern standby mode" in the BIOS really has anything to do with it...

I've seen many people all over the internet complaining about similar issues, some even on AMD CPUs. Apparently same symptoms...

I don't have a serial port on this PC, nor do I have any USB-to-Serial converter... Any way of emulating it over ethernet perhaps? Or USB to USB?
Comment 3 Borislav Petkov 2022-01-13 10:23:16 UTC
> sometimes with uptimes of a full month without any crash.

What does that mean exactly, 5.9.15 would sometimes crash too?

> Any way of emulating it over ethernet perhaps? Or USB to USB?

You could try netconsole:

https://www.kernel.org/doc/html/latest/networking/netconsole.html

and see whether that catches anything...
Comment 4 Bruno Gravato 2022-01-13 21:57:01 UTC
(In reply to Borislav Petkov from comment #3)
> > sometimes with uptimes of a full month without any crash.
> 
> What does that mean exactly, 5.9.15 would sometimes crash too?

No, 5.9.15 never ever crashed. Rock-solid!
I just need to reboot sometimes for other reasons (usually just to try a new kernel to see if my issue got fixed in newer kernels), so the longest uptime was perhaps a full month, but not because it crashed. Just because I intentionally rebooted or shutdown.

When I bought this NUC (late 2020), I first used kernel 4.10 (the default in debian buster at the time), then I think I upgraded to 5.8 (from buster-backports) and later 5.9 and 5.10...
This issues started happening since 5.10. So I kept going back to the last known working version which was 5.9.15.
Versions prior to that I think didn't crash either, but I didn't use them for long enough to be sure.

If it wasn't for this 5.9 kernel, I would say it was probably some hardware issue... But it's really odd that in about 12 months using this kernel it never crashed... Any attempt at using newer kernels it always crashes, whether it's after 2h idle, or just after a few days...

Most of the times it crashes after a few hours or just a couple of days. For a few occasions I got uptimes of 5-6 days before crashing. Once (and only once) it nearly reached 2 weeks, it was with a 5.13 kernel, at the time I first thought the issue got fixed, but then it eventually crashed... I rebooted to the same kernel, but then it crashed again just after a few hours... so I guessed that that time was just pure luck...

It never crashes while I'm actively using it (this is my desktop pc, that I use daily). It only crashes when I'm AFK. Shortest idle period time before crash I remember was about 30 minutes, but it usually takes at least 2h idle before crashing (I calculate that by looking at the timestamps for the last login lock entry and the last entry in the logs from previous boot).

I've tried limiting C-states with intel.max_cstate parameter on boot. It still crashed except with max_cstate=1, but I only run it for 2-3 days. Power consumption idle almost tripled, so it wasn't a viable workaround for me.

The most recent twist in this story was when I put it in suspend mode by mistake (I never use suspend) some days ago. Apparently it went to sleep correctly: power light turned from blue to amber (blinking, as it was configured in the BIOS settings). I then used the  keyboard/mouse and it started waking up (power light changed from amber to blue as supposed and I think the fan started spinning too), but the OS didn't wake (no image, no disk activity, no replies to pings over the network, same symptoms as with my usual idle crashes). I hard-rebooted and tried suspend again with same results.

I found this odd, because when I bought this NUC (late 2020) I tried suspend mode at the time and it worked fine. Not sure what kernel version I was using then.
Since then I updated the BIOS firmware a couple of times in the hope it would fix these random crashes (it didn't). Since then I probably didn't put it to suspend again until this time. Not sure if it was the BIOS updates that caused the change in behaviour...

Anyway, I went to the BIOS config and I saw there was a "standby mode" option that I had never noticed before (doesn't mean it wasn't there...). It was set to "Legacy standby mode". I switched it to "modern standby mode" and tested suspend again and it woke up successfully...

Since the crash symptoms were similar, I thought... let me give it a try with a newer kernel... so I rebooted to kernel 5.10.84 (currently the latest one in debian stable) and it hasn't crashed yet for 12 days... May be just coincidence... maybe not...

I may wait a couple more days and then I'll try reverting that BIOS setting to legacy standby mode and try this kernel (5.10.84) again and see if it crashes (I'll try running netconsole to catch some logs, thanks for the tip).

This is very hard to debug and can take a very long time because crash times can range from 2h to 5 days... Adding to the fact that this is my daily driver, it makes it even more inconvenient to run this kind of tests... but I'll give it a try.

Sorry for the long post... just trying to give in as much info as possible...
Comment 5 Borislav Petkov 2022-01-15 12:35:38 UTC
(In reply to Bruno Gravato from comment #4)
> When I bought this NUC (late 2020), I first used kernel 4.10 (the default in
> debian buster at the time), then I think I upgraded to 5.8 (from
> buster-backports) and later 5.9 and 5.10...
> This issues started happening since 5.10. So I kept going back to the last
> known working version which was 5.9.15.
> Versions prior to that I think didn't crash either, but I didn't use them
> for long enough to be sure.

Ok, what about 5.9.16?

This is the last stable release before 5.9 has been closed.

I'm asking because you probably could bisect.

I say probably because in reality, because it takes so long to trigger
the issue, you might end up bisecting the kernel a looooong time.

> If it wasn't for this 5.9 kernel, I would say it was probably some hardware
> issue... But it's really odd that in about 12 months using this kernel it
> never crashed... Any attempt at using newer kernels it always crashes,
> whether it's after 2h idle, or just after a few days...

Sounds familiar. Nasty bugs, likely a BIOS issue.

> It never crashes while I'm actively using it (this is my desktop pc, that I
> use daily). It only crashes when I'm AFK. Shortest idle period time before
> crash I remember was about 30 minutes, but it usually takes at least 2h idle
> before crashing (I calculate that by looking at the timestamps for the last
> login lock entry and the last entry in the logs from previous boot).
> 
> I've tried limiting C-states with intel.max_cstate parameter on boot. It

You mean intel_idle.max_cstate?

> still crashed except with max_cstate=1, but I only run it for 2-3 days.
> Power consumption idle almost tripled, so it wasn't a viable workaround for
> me.

Sounds like it goes into some deeper idle state and it can't come out of it.

> Anyway, I went to the BIOS config and I saw there was a "standby mode"
> option that I had never noticed before (doesn't mean it wasn't there...). It
> was set to "Legacy standby mode". I switched it to "modern standby mode" and
> tested suspend again and it woke up successfully...
> 
> Since the crash symptoms were similar, I thought... let me give it a try
> with a newer kernel... so I rebooted to kernel 5.10.84 (currently the latest
> one in debian stable) and it hasn't crashed yet for 12 days... May be just
> coincidence... maybe not...

You might be onto something here. It'll be interesting to see whether
that "modern" standby mode works...

> I may wait a couple more days and then I'll try reverting that BIOS setting
> to legacy standby mode and try this kernel (5.10.84) again and see if it
> crashes (I'll try running netconsole to catch some logs, thanks for the tip).

Yap. By the sound of it, I'm sceptical netconsole'll catch anything from idle...

> This is very hard to debug and can take a very long time because crash times
> can range from 2h to 5 days... Adding to the fact that this is my daily
> driver, it makes it even more inconvenient to run this kind of tests... but
> I'll give it a try.

Yah, stuff like that is nasty to debug.
 
> Sorry for the long post... just trying to give in as much info as possible...

Nah, don't be. If only all the bug reporters would take the time and
write up the symptoms with such detail, a couple more bugs would be
solved. :-)

So thanks!
Comment 6 Bruno Gravato 2022-01-15 13:52:21 UTC
(In reply to Borislav Petkov from comment #5)
> Ok, what about 5.9.16?
> 
> This is the last stable release before 5.9 has been closed.
> 
> I'm asking because you probably could bisect.
> 
> I say probably because in reality, because it takes so long to trigger
> the issue, you might end up bisecting the kernel a looooong time.

I didn't try 5.9.16. I think the last 5.9 version that got into debian's buster-backports was 5.9.15, from there I think they jumped to 5.10.xx (not exactly sure which minor version). That's when the problems started...

I meant to try each version from 5.9.15 up, going through 5.10.0 and so on... but I didn't have the chance yet.

I once compiled a 5.10.xx version (again I don't recall which minor version, sorry), using the kernel config from debian 5.9.15 and leaving the new features at the default setting. It also crashed.

> > I've tried limiting C-states with intel.max_cstate parameter on boot. It
> 
> You mean intel_idle.max_cstate?

Yes, sorry for the typo, I meant intel_idle.max_cstate

> > still crashed except with max_cstate=1, but I only run it for 2-3 days.
> > Power consumption idle almost tripled, so it wasn't a viable workaround for
> > me.
> 
> Sounds like it goes into some deeper idle state and it can't come out of it.

My thought too... Though Dmitry (the OP of this bug report) says in his tests that didn't help...
Like I said, I didn't try that for long enough to be sure...
One of the reasons I use a mini-pc as my desktop PC is to keep the electricity bill low... That option almost tripled power consumption during idle periods, so I didn't really care to thoroughly test it.

Dmitry said (on the reddit thread) that he now tried disabling monitor suspend mode with:
  xset -dpms
  xset s off 

And that seems to be holding up too so far for him... (I haven't tried that yet)

Could it be some bug in the i915 GPU driver. When I was still on debian buster, Xorg crashed a couple of times while doing some GPU intensive things. Only X crashed, I could restart it without rebooting, but there were some i915 errors in the syslog when that happened.

Since I upgraded to Debian Bullseye (and by consequence upgraded mesa libs as well) that didn't happen again. So I blamed that on some bug related to mesa libs.

I think I also tried disabling some low power states for my nvme ssd (saw someone referring to that on some forum), but that didn't help either.

On my many web searches about the subject I found many posts from a lot of people having similar issues. Different hardware and even different kernel versions...
Some ubuntu user said he had issues with older version of the kernel (5.8 IIRC) and that upgrading to 5.11 seemed to solve the problem for him.
I even saw one guy with an AMD CPU complaining about the same symptoms!
Hard to say if any of these were related... I guess a lot of things could cause the system to freeze during idle that probably are not related to each other...

> > Anyway, I went to the BIOS config and I saw there was a "standby mode"
> > option that I had never noticed before (doesn't mean it wasn't there...).
> It
> > was set to "Legacy standby mode". I switched it to "modern standby mode"
> and
> > tested suspend again and it woke up successfully...
> > 
> > Since the crash symptoms were similar, I thought... let me give it a try
> > with a newer kernel... so I rebooted to kernel 5.10.84 (currently the
> latest
> > one in debian stable) and it hasn't crashed yet for 12 days... May be just
> > coincidence... maybe not...
> 
> You might be onto something here. It'll be interesting to see whether
> that "modern" standby mode works...

Here comes the latest news... I rebooted yesterday and switched that BIOS option back to Legacy S3 Standby (I thinks that's the correct name). Booted into the same kernel (5.10.84). It crashed during the night!!
So yeah I think I might be into something here...

I haven't managed to get netconsole to work yet. Trying to figure that out now and see if I can catch something.

> Yap. By the sound of it, I'm sceptical netconsole'll catch anything from
> idle...

Me too, but worth giving it a shot...

Next thing will be switching again back to Modern Standby mode and see if that continues to work or if I was just on a lucky streak for 12 consecutive days...

I'll keep you posted of any further advances.
Comment 7 Dmitry 2022-01-15 17:36:11 UTC
> Ok, what about 5.9.16?
I can confirm that there was no issue on 5.9.16. So this should be some change towards 5.10.*


Also, as commented by Bruno above, I confirm that I managed to resolve this issue by disabling monitor suspend:

  xset -dpms
  xset s off 

No issue in more than 18 days since then.
Comment 8 Bruno Gravato 2022-01-16 11:47:58 UTC
It crashed again last night!

netconsole was running, but it didn't catch anything relevant.

I'm back to "Modern standby" setting in the BIOS. Lets see what happens... I'll report back in the next days.
Comment 9 Bruno Gravato 2022-01-17 17:24:07 UTC
Here comes the sad news...

It crashed today after 2h of idle time.

So yeah... Modern Standby isn't the solution.

I guess that earlier 12 day streak without crashing was just luck...

I will now try Dmitry's suggestion of disabling monitor suspend. Which isn't ideal from a power saving point of view, I can minimize it by manually turning the monitors off when I'm AFK.

After that, the only thing I can think of is compiling each kernel version beyond 5.9.15 (or 5.9.16) and try to figure out when this was introduced, as discussed before.

Meanwhile, there was another person posting on reddit about the same issue. He has a NUC8i7 (same as Dmitry). He said he tried old kernel 4.15, but it also crashed.
Comment 10 Bruno Gravato 2022-01-19 11:30:27 UTC
Last night got another crash, while using:

  xset -dpms
  xset s off

I did lock my session, so not sure if that overrode any of those settings.
But if I can't lock my session then this workaround wouldn't work for me...

Anyway back to square one!

Next up will be going through the tedious task of compiling each minor kernel version in 5.10 and try to figure out what changed since 5.9.15 or 5.9.16.

I'll report back once I get that done. Might take a while...
Comment 11 Bruno Gravato 2022-01-21 11:41:02 UTC
I've got some interesting new developments!

I compiled kernel 5.10.1 using the .config file from my working kernel 5.9.15, leaving any new settings at default.
I run it for a couple of days until it crashed this morning, but this time it was different!

I use lightdm and light-locker to lock my session when I'm AFK.
This morning when I got to the PC, it was still running... I entered my password to login in lightdm and I got a black screen with frozen mouse pointer in the center. This had never happened before...
Mouse pointer didn't move and I couldn't change to console tty with Ctrl-Alt-F1 either, BUT network was still working (responding to pings) and I was able to login remotely via ssh.

I checked journalctl and there were some kernel errors (posted below).
I tried to restart lightdm with systemctl restart lightdm but the system completely froze. Screen was still the same (black with mouse pointer frozen at the center), but no longer responding to pings or anything. The ssh connection was frozen too.

Judging by the logs I'm guessing 9:55:54 was when I attempted to login via lightdm (logs before that are just related to normal cron jobs, vpn connection, etc.)
At 9:58:38 I logged in remotely via ssh and shortly after that I tried to restart lightdm via systemctl. No logs of that or beyond until I rebooted.

This crash could of course be totally unrelated and be due to some other kernel bug in this version... Borislav what do you think?
What should be my next move? For now I'm running 5.10.1 again to see if this happens again... Shall I instead move to 5.10.2? Or perhaps to 5.9.16?
Any other suggestion?

I've successfully locked my session and logged in via lightdm a few times, with this kernel, before and after this crash. So no reproducible pattern here...

Here's the complete journalctl log before the crash:

Jan 21 09:55:54 brunuc lightdm[1759871]: gkr-pam: unable to locate daemon control file
Jan 21 09:55:54 brunuc lightdm[1759871]: gkr-pam: stashed password to try later in open session
Jan 21 09:55:54 brunuc lightdm[1759871]: Error getting user list from org.freedesktop.Accounts: GDBus.Error:org.freedesktop.DBus.Error.ServiceUnknown: The name org.freedesktop.Accounts was not provided by any .service files
Jan 21 09:55:54 brunuc systemd[1]: Stopping Session c5 of user lightdm.
Jan 21 09:55:54 brunuc kernel: BUG: unable to handle page fault for address: ffffbbbb81fdfcf8
Jan 21 09:55:54 brunuc kernel: #PF: supervisor read access in kernel mode
Jan 21 09:55:54 brunuc kernel: #PF: error_code(0x0000) - not-present page
Jan 21 09:55:54 brunuc kernel: PGD 100000067 P4D 100000067 PUD 1001a9067 PMD 10588a067 PTE 0
Jan 21 09:55:54 brunuc kernel: Oops: 0000 [#2] SMP PTI
Jan 21 09:55:54 brunuc kernel: CPU: 0 PID: 1318 Comm: Xorg Tainted: G      D W   E     5.10.1 #1
Jan 21 09:55:54 brunuc kernel: Hardware name: Intel(R) Client Systems NUC8i5BEH/NUC8BEB, BIOS BECFL357.86A.0089.2021.0621.1343 06/21/2021
Jan 21 09:55:54 brunuc kernel: RIP: 0010:__wake_up_common+0x55/0x180
Jan 21 09:55:54 brunuc kernel: Code: 41 f6 01 04 0f 85 ab 00 00 00 48 8b 43 08 4c 8d 40 e8 48 8d 43 08 48 89 04 24 48 89 c6 49 8d 40 18 48 39 c6 0f 84 e2 00 00 00 <49> 8b 40 18 89 6c 24 14 31 ed 4c 8d 60 e8 41 8b 18 f6 c3 04 75 59
Jan 21 09:55:54 brunuc kernel: RSP: 0018:ffffbbbb8095fbb0 EFLAGS: 00010083
Jan 21 09:55:54 brunuc kernel: RAX: ffffbbbb81fdfcf8 RBX: ffff963651155800 RCX: 0000000000000001
Jan 21 09:55:54 brunuc kernel: RDX: 0000000000000001 RSI: ffff963651155808 RDI: ffff963651155800
Jan 21 09:55:54 brunuc kernel: RBP: 0000000000000001 R08: ffffbbbb81fdfce0 R09: ffffbbbb8095fc00
Jan 21 09:55:54 brunuc kernel: R10: 0000000000000000 R11: ffff963736cc0480 R12: 0000000000000001
Jan 21 09:55:54 brunuc kernel: R13: 0000000000000001 R14: 0000000000000001 R15: 00000000000000c3
Jan 21 09:55:54 brunuc kernel: FS:  00007f8e157b7a40(0000) GS:ffff9639b0e00000(0000) knlGS:0000000000000000
Jan 21 09:55:54 brunuc kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Jan 21 09:55:54 brunuc kernel: CR2: ffffbbbb81fdfcf8 CR3: 000000010555e003 CR4: 00000000003706f0
Jan 21 09:55:54 brunuc kernel: Call Trace:
Jan 21 09:55:54 brunuc kernel:  __wake_up_common_lock+0x7c/0xc0
Jan 21 09:55:54 brunuc kernel:  sock_def_readable+0x37/0x70
Jan 21 09:55:54 brunuc kernel:  unix_stream_sendmsg+0x1de/0x4d0
Jan 21 09:55:54 brunuc kernel:  sock_sendmsg+0x5e/0x60
Jan 21 09:55:54 brunuc kernel:  sock_write_iter+0x97/0x100
Jan 21 09:55:54 brunuc kernel:  do_iter_readv_writev+0x152/0x1b0
Jan 21 09:55:54 brunuc kernel:  do_iter_write+0x7c/0x1b0
Jan 21 09:55:54 brunuc kernel:  vfs_writev+0xa4/0x140
Jan 21 09:55:54 brunuc kernel:  ? __schedule+0x28a/0x870
Jan 21 09:55:54 brunuc kernel:  do_writev+0xde/0x110
Jan 21 09:55:54 brunuc kernel:  do_syscall_64+0x33/0x80
Jan 21 09:55:54 brunuc kernel:  entry_SYSCALL_64_after_hwframe+0x44/0xa9
Jan 21 09:55:54 brunuc kernel: RIP: 0033:0x7f8e15c22ddd
Jan 21 09:55:54 brunuc kernel: Code: 28 89 54 24 1c 48 89 74 24 10 89 7c 24 08 e8 ba ff f8 ff 8b 54 24 1c 48 8b 74 24 10 41 89 c0 8b 7c 24 08 b8 14 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 33 44 89 c7 48 89 44 24 08 e8 ee ff f8 ff 48
Jan 21 09:55:54 brunuc kernel: RSP: 002b:00007ffe09495560 EFLAGS: 00000293 ORIG_RAX: 0000000000000014
Jan 21 09:55:54 brunuc kernel: RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007f8e15c22ddd
Jan 21 09:55:54 brunuc kernel: RDX: 0000000000000001 RSI: 00007ffe09495850 RDI: 0000000000000030
Jan 21 09:55:54 brunuc kernel: RBP: 00005584f8934050 R08: 0000000000000000 R09: 0000000000000001
Jan 21 09:55:54 brunuc kernel: R10: 0000000000000020 R11: 0000000000000293 R12: 0000000000000001
Jan 21 09:55:54 brunuc kernel: R13: 00007ffe09495850 R14: 0000000000000000 R15: 00005584f902ba20
Jan 21 09:55:54 brunuc kernel: Modules linked in: ctr(E) ccm(E) rfcomm(E) tun(E) wireguard(E) libchacha20poly1305(E) chacha_x86_64(E) poly1305_x86_64(E) ip6_udp_tunnel(E) udp_tunnel(E) libblake2s(E) blake2s_x86_64(E) libblake2s_generic(E)>
Jan 21 09:55:54 brunuc kernel:  snd_soc_skl(E) snd_hda_codec_hdmi(E) bluetooth(E) snd_soc_hdac_hda(E) snd_hda_ext_core(E) snd_soc_sst_ipc(E) jitterentropy_rng(E) snd_soc_sst_dsp(E) snd_soc_acpi_intel_match(E) snd_soc_acpi(E) snd_hda_codec>
Jan 21 09:55:54 brunuc kernel:  rng_core(E) evdev(E) acpi_pad(E) intel_pmc_core(E) acpi_tad(E) msr(E) drivetemp(E) parport_pc(E) ppdev(E) lp(E) parport(E) fuse(E) configfs(E) efivarfs(E) ip_tables(E) x_tables(E) autofs4(E) ext4(E) crc32c_>
Jan 21 09:55:54 brunuc kernel: CR2: ffffbbbb81fdfcf8
Jan 21 09:55:54 brunuc kernel: ---[ end trace 9d7cd3ee3a0e588c ]---
Jan 21 09:55:54 brunuc kernel: RIP: 0010:native_queued_spin_lock_slowpath+0x193/0x1d0
Jan 21 09:55:54 brunuc kernel: Code: ff ff ff c6 47 01 00 e9 21 ff ff ff c1 ee 12 83 e0 03 83 ee 01 48 c1 e0 05 48 63 f6 48 05 80 c9 02 00 48 03 04 f5 00 e9 d5 b0 <48> 89 10 8b 42 08 85 c0 75 09 f3 90 8b 42 08 85 c0 74 f7 48 8b 32
Jan 21 09:55:54 brunuc kernel: RSP: 0018:ffffbbbb81fdfa98 EFLAGS: 00010082
Jan 21 09:55:54 brunuc kernel: RAX: ffffffffb1361980 RBX: ffffbbbb81fdfca0 RCX: 0000000000140000
Jan 21 09:55:54 brunuc kernel: RDX: ffff9639b0f2c980 RSI: 000000000000000f RDI: ffff963642ccfa08
Jan 21 09:55:54 brunuc kernel: RBP: ffff963642ccfa08 R08: 0000000000140000 R09: ffffffffb0e4a2b8
Jan 21 09:55:54 brunuc kernel: R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000286
Jan 21 09:55:54 brunuc kernel: R13: ffffbbbb81fdfc90 R14: 0000000000000001 R15: ffff9637cf59c400
Jan 21 09:55:54 brunuc kernel: FS:  00007f8e157b7a40(0000) GS:ffff9639b0e00000(0000) knlGS:0000000000000000
Jan 21 09:55:54 brunuc kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Jan 21 09:55:54 brunuc kernel: CR2: ffffbbbb81fdfcf8 CR3: 000000010555e003 CR4: 00000000003706f0
Jan 21 09:55:54 brunuc kernel: efivars: get_next_variable: status=8000000000000015
Jan 21 09:57:24 brunuc systemd[1]: session-c5.scope: Stopping timed out. Killing.
Jan 21 09:57:24 brunuc systemd[1]: session-c5.scope: Killing process 1759784 (lightdm) with signal SIGKILL.
Jan 21 09:57:24 brunuc systemd[1]: session-c5.scope: Killing process 1759824 (slick-greeter) with signal SIGKILL.
Jan 21 09:57:24 brunuc systemd[1]: session-c5.scope: Killing process 1759834 (gdbus) with signal SIGKILL.
Jan 21 09:57:24 brunuc systemd[1]: session-c5.scope: Killing process 1759847 (n/a) with signal SIGKILL.
Jan 21 09:57:24 brunuc systemd[1]: session-c5.scope: Killing process 1759848 (n/a) with signal SIGKILL.
Jan 21 09:57:24 brunuc systemd[1]: session-c5.scope: Killing process 1759849 (n/a) with signal SIGKILL.
Jan 21 09:57:24 brunuc systemd[1]: session-c5.scope: Killing process 1759850 (n/a) with signal SIGKILL.
Jan 21 09:57:24 brunuc systemd[1]: session-c5.scope: Failed with result 'timeout'.
Jan 21 09:57:24 brunuc systemd[1]: Stopped Session c5 of user lightdm.
Jan 21 09:57:24 brunuc systemd[1]: session-c5.scope: Consumed 1min 24.015s CPU time.
Jan 21 09:57:24 brunuc systemd-logind[531]: Removed session c5.
Jan 21 09:57:24 brunuc bluetoothd[912]: Endpoint unregistered: sender=:1.272 path=/MediaEndpoint/A2DPSink/sbc
Jan 21 09:57:24 brunuc bluetoothd[912]: Endpoint unregistered: sender=:1.272 path=/MediaEndpoint/A2DPSource/sbc
Jan 21 09:57:24 brunuc systemd[1759788]: pulseaudio.service: Succeeded.
Jan 21 09:57:34 brunuc systemd[1]: Stopping User Manager for UID 116...
Jan 21 09:57:34 brunuc systemd[1759788]: Stopped target Main User Target.
Jan 21 09:57:34 brunuc systemd[1759788]: Stopping Accessibility services bus...
Jan 21 09:57:34 brunuc gvfsd[1759857]: A connection to the bus can't be made
Jan 21 09:57:34 brunuc systemd[1759788]: Stopping D-Bus User Message Bus...
Jan 21 09:57:34 brunuc systemd[1759788]: Stopping Virtual filesystem service...
Jan 21 09:57:34 brunuc systemd[1759788]: at-spi-dbus-bus.service: Succeeded.
Jan 21 09:57:34 brunuc systemd[1759788]: Stopped Accessibility services bus.
Jan 21 09:57:34 brunuc systemd[1759788]: dbus.service: Succeeded.
Jan 21 09:57:34 brunuc systemd[1759788]: Stopped D-Bus User Message Bus.
Jan 21 09:57:34 brunuc systemd[1222]: run-user-116-gvfs.mount: Succeeded.
Jan 21 09:57:34 brunuc systemd[1759788]: run-user-116-gvfs.mount: Succeeded.
Jan 21 09:57:34 brunuc systemd[1]: run-user-116-gvfs.mount: Succeeded.
Jan 21 09:57:34 brunuc systemd[1759788]: gvfs-daemon.service: Succeeded.
Jan 21 09:57:34 brunuc systemd[1759788]: Stopped Virtual filesystem service.
Jan 21 09:57:34 brunuc systemd[1759788]: Stopped target Basic System.
Jan 21 09:57:34 brunuc systemd[1759788]: Stopped target Paths.
Jan 21 09:57:34 brunuc systemd[1759788]: Stopped target Sockets.
Jan 21 09:57:34 brunuc systemd[1759788]: Stopped target Timers.
Jan 21 09:57:34 brunuc systemd[1759788]: dbus.socket: Succeeded.
Jan 21 09:57:34 brunuc systemd[1759788]: Closed D-Bus User Message Bus Socket.
Jan 21 09:57:34 brunuc systemd[1759788]: dirmngr.socket: Succeeded.
Jan 21 09:57:34 brunuc systemd[1759788]: Closed GnuPG network certificate management daemon.
Jan 21 09:57:34 brunuc systemd[1759788]: gpg-agent-browser.socket: Succeeded.
Jan 21 09:57:34 brunuc systemd[1759788]: Closed GnuPG cryptographic agent and passphrase cache (access for web browsers).
Jan 21 09:57:34 brunuc systemd[1759788]: gpg-agent-extra.socket: Succeeded.
Jan 21 09:57:34 brunuc systemd[1759788]: Closed GnuPG cryptographic agent and passphrase cache (restricted).
Jan 21 09:57:34 brunuc systemd[1759788]: gpg-agent-ssh.socket: Succeeded.
Jan 21 09:57:34 brunuc systemd[1759788]: Closed GnuPG cryptographic agent (ssh-agent emulation).
Jan 21 09:57:34 brunuc systemd[1759788]: gpg-agent.socket: Succeeded.
Jan 21 09:57:34 brunuc systemd[1759788]: Closed GnuPG cryptographic agent and passphrase cache.
Jan 21 09:57:34 brunuc systemd[1759788]: pk-debconf-helper.socket: Succeeded.
Jan 21 09:57:34 brunuc systemd[1759788]: Closed debconf communication socket.
Jan 21 09:57:34 brunuc systemd[1759788]: pulseaudio.socket: Succeeded.
Jan 21 09:57:34 brunuc systemd[1759788]: Closed Sound System.
Jan 21 09:57:34 brunuc systemd[1759788]: Removed slice User Application Slice.
Jan 21 09:57:34 brunuc systemd[1759788]: Reached target Shutdown.
Jan 21 09:57:34 brunuc systemd[1759788]: systemd-exit.service: Succeeded.
Jan 21 09:57:34 brunuc systemd[1759788]: Finished Exit the Session.
Jan 21 09:57:34 brunuc systemd[1759788]: Reached target Exit the Session.
Jan 21 09:57:34 brunuc systemd[1]: user@116.service: Succeeded.
Jan 21 09:57:34 brunuc systemd[1]: Stopped User Manager for UID 116.
Jan 21 09:57:34 brunuc systemd[1]: Stopping User Runtime Directory /run/user/116...
Jan 21 09:57:34 brunuc systemd[1222]: run-user-116.mount: Succeeded.
Jan 21 09:57:34 brunuc systemd[1]: run-user-116.mount: Succeeded.
Jan 21 09:57:34 brunuc systemd[1]: user-runtime-dir@116.service: Succeeded.
Jan 21 09:57:34 brunuc systemd[1]: Stopped User Runtime Directory /run/user/116.
Jan 21 09:57:34 brunuc systemd[1]: Removed slice User Slice of UID 116.
Jan 21 09:57:34 brunuc systemd[1]: user-116.slice: Consumed 1min 24.362s CPU time.
Jan 21 09:58:38 brunuc sshd[2500970]: Accepted password for bruno from 192.168.1.30 port 45166 ssh2
Jan 21 09:58:38 brunuc sshd[2500970]: pam_unix(sshd:session): session opened for user bruno(uid=1000) by (uid=0)
Jan 21 09:58:38 brunuc systemd-logind[531]: New session 368 of user bruno.
Jan 21 09:58:38 brunuc systemd[1]: Started Session 368 of user bruno.
Jan 21 09:58:43 brunuc sudo[2500991]:    bruno : TTY=pts/7 ; PWD=/root ; USER=root ; COMMAND=/bin/bash
Jan 21 09:58:43 brunuc sudo[2500991]: pam_unix(sudo:session): session opened for user root(uid=0) by bruno(uid=1000)
Comment 12 Bruno Gravato 2022-01-23 11:14:54 UTC
I got another crash on kernel 5.10.1. This one was the "usual" way, ie. it just froze during the night. No logs, nothing.

So next step will be building 5.9.16 and trying that one for a few days.
Comment 13 Bruno Gravato 2022-01-28 12:52:37 UTC
Still early to be absolute sure, but so far no crashes on 5.9.16.

I just realized there's a 5.10 release prior to 5.10.1, which I didn't try, but the difference between 5.10 and 5.10.1 is quite minimal, so I'd assume the change that causes trouble happened between 5.9.16 and 5.10.

This doesn't help much though, because the differences between 5.9.16 and 5.10 are quite massive!
So not even sure where to start...

Any tips?
Comment 14 Dmitry 2022-01-30 18:33:31 UTC
I went back to check my double-check my working configuration and it results that I am using `max_cstate=1` in combination with those monitor settings:

  xset -dpms
  xset s off 

Mentioning this for completeness of possible reasons why my current setup works without issues.
Comment 15 Borislav Petkov 2022-02-07 13:44:40 UTC
(In reply to Bruno Gravato from comment #11)
> Jan 21 09:55:54 brunuc kernel: BUG: unable to handle page fault for address:
> ffffbbbb81fdfcf8
> Jan 21 09:55:54 brunuc kernel: #PF: supervisor read access in kernel mode
> Jan 21 09:55:54 brunuc kernel: #PF: error_code(0x0000) - not-present page
> Jan 21 09:55:54 brunuc kernel: PGD 100000067 P4D 100000067 PUD 1001a9067 PMD
> 10588a067 PTE 0
> Jan 21 09:55:54 brunuc kernel: Oops: 0000 [#2] SMP PTI
> Jan 21 09:55:54 brunuc kernel: CPU: 0 PID: 1318 Comm: Xorg Tainted: G      D
> W   E     5.10.1 #1

Hmm, if that's that whole splat, then your machine must've hit a warning
or a previous oops because this one looks like a follow-up one. Can you
send me full dmesg from booting on that machine, privately is fine too.

> Jan 21 09:55:54 brunuc kernel: Hardware name: Intel(R) Client Systems
> NUC8i5BEH/NUC8BEB, BIOS BECFL357.86A.0089.2021.0621.1343 06/21/2021
> Jan 21 09:55:54 brunuc kernel: RIP: 0010:__wake_up_common+0x55/0x180

That's rIP is weird, looks like a corruption of sorts.

> Jan 21 09:55:54 brunuc kernel: Code: 41 f6 01 04 0f 85 ab 00 00 00 48 8b 43
> 08 4c 8d 40 e8 48 8d 43 08 48 89 04 24 48 89 c6 49 8d 40 18 48 39 c6 0f 84 e2
> 00 00 00 <49> 8b 40 18 89 6c 24 14 31 ed 4c 8d 60 e8 41 8b 18 f6 c3 04 75 59

  1a:   48 89 c6                mov    %rax,%rsi
  1d:   49 8d 40 18             lea    0x18(%r8),%rax
  21:   48 39 c6                cmp    %rax,%rsi
  24:   0f 84 e2 00 00 00       je     0x10c
  2a:*  49 8b 40 18             mov    0x18(%r8),%rax           <-- trapping instruction
  2e:   89 6c 24 14             mov    %ebp,0x14(%rsp)
  32:   31 ed                   xor    %ebp,%ebp
  34:   4c 8d 60 e8             lea    -0x18(%rax),%r12

and if I try to correlate it with my kernel here:

ffffffff810b2ddf:       48 89 c6                mov    %rax,%rsi
ffffffff810b2de2:       48 89 44 24 08          mov    %rax,0x8(%rsp)
ffffffff810b2de7:       49 8d 40 18             lea    0x18(%r8),%rax
ffffffff810b2deb:       48 39 c6                cmp    %rax,%rsi
ffffffff810b2dee:       0f 84 e4 00 00 00       je     ffffffff810b2ed8 <__wake_up_common+0x138>
ffffffff810b2df4:       49 8b 40 18             mov    0x18(%r8),%rax				<--- trap
ffffffff810b2df8:       89 6c 24 14             mov    %ebp,0x14(%rsp)
ffffffff810b2dfc:       c7 04 24 00 00 00 00    movl   $0x0,(%rsp)
ffffffff810b2e03:       48 8d 58 e8             lea    -0x18(%rax),%rbx

which looks like:

# kernel/sched/wait.c:83: 	if (&curr->entry == &wq_head->head)
	leaq	24(%r8), %rax	#, tmp122
# kernel/sched/wait.c:83: 	if (&curr->entry == &wq_head->head)
	cmpq	%rax, %rsi	# tmp122, _5
	je	.L82	#,
# kernel/sched/wait.c:86: 	list_for_each_entry_safe_from(curr, next, &wq_head->head, entry) {
	movq	24(%r8), %rax	# curr_18->entry.next, tmp145					<--- trapping instruction
	movl	%ebp, 20(%rsp)	# nr_exclusive, %sfp
# kernel/sched/wait.c:71: 	int cnt = 0;
	movl	$0, (%rsp)	#, %sfp
# kernel/sched/wait.c:86: 	list_for_each_entry_safe_from(curr, next, &wq_head->head, entry) {
	leaq	-24(%rax), %rbx	#, next
.L81:

%r8 contains that curr pointer: 0xffffbbbb81fdfce0 which is 0x18 away
from the rIP you're getting the page fault at.

Which looks like a list corruption to me.

But this should not normally happen because we will be seeing it left
and right. Or maybe 5.10-stable is missing a fix...
Comment 16 Bruno Gravato 2022-02-07 13:56:35 UTC
Thank you for your reply.

I'll see if I still have that full log available and email it to you.

Although my guess is that that's probably an unrelated bug, perhaps some early 5.10 bug. I had never seen that before. It happened while I was logging in, not during idle.

After that I got crashes during idle (with 5.10.1) with nothing on the logs as usual.

I've now been running 5.9.16 for over a week with no crashes.
Comment 17 Bruno Gravato 2022-06-23 23:06:36 UTC
Unfortunately I haven't had the opportunity to start bissecting kernel 5.10.0 to see if I can get to any conclusion on what may have changed...

Earlier this week I tried kernel 5.18.2 and it still crashes...

Now here's a new piece of information... I currently have the NUC's power adapter connected to a power meter on the wall socket to monitor its power consumption.

The power consumption measured is constantly changing. During idle periods it usually varies between multiple values in the 5-11W range. During normal use (as desktop) with light use it varies mostly between 11W and 20W (sometimes between 20W and 30W). During heavy loads it goes over 50W or even 60W.

When it crashed it was drawing a very constant and steady load. I didn't write down the number, but I think it was 22.6W (I'm sure it was 20-something and I stared at it for a while and it didn't change a tenth!).

This is significantly more than it usually draws during idle periods and the fact that it "froze" at a very fixed current I guess confirms that it's really "frozen", right?

I'm now doing another experiment: I'm running a resource monitor (btop) on a terminal window, that constantly draws some (text-mode) graphics, to make sure there's some constant activity going on and I'm also running a ping at 1 min. intervals to my router and writing the output to disk with tee, also to keep some frequent activity on the disk.
I didn't really notice any significant increase in power consumption while keeping this light activity, so if this is enough to prevent it from crashing I'd be OK with this workaround...

It survived the first night... If it holds I'll report back in a couple of weeks... (or earlier if it doesn't)
Comment 18 Bruno Gravato 2022-07-01 14:20:50 UTC
Following up on my last comment...

After a few days it finally crashed during the night...
This time the power meter was showing only 6.3W power consumption (absolutely steady again). The NUC fan was still spinning (but now it's summer so it's normal the fan is spinning all the time).

I wasn't running btop at the time, but I was still running that ping at 60 sec interval and outputting to a file. So that didn't make the trick.

Last entry log in syslog was about the same time as the last ping store in file. As always nothing unusual in the logs.

It didn't crash while btop was running... Maybe just a coincidence or maybe btop generates enough activity in the CPU that prevents it from going into the idle state that causes the crash? I'll keep testing.

I now have an USB audio interface that doesn't work properly on 5.9.16, so I need to sort this out ASAP and move to a newer kernel, so I'll be putting more effort into this. More updates soon...

Anyway, I think the most interesting fact from this last crash is that it froze consuming significantly less power than the other time... 22.6W to 6.3W is a very significant difference! It seems like whatever it's doing when it freezes it just stays on some strange infinity loop...
Comment 19 Reny 2022-09-14 20:29:43 UTC
Random question since I have the same hardware and recently experienced a similar problem:

Is your machine perhaps using Vulkan for something?

When I decided to try mpv --vo=gpu-next (which uses Vulkan GPU backend) the kernel would panic every 12-36 hours, always while idling with the display powered off/DP disconnected.

Netconsole output:

 mce: CPUs not responding to MCE broadcast (may include false positives): 0-1,3-5,7
 mce: CPUs not responding to MCE broadcast (may include false positives): 0-1,3-5,7
 Kernel panic - not syncing: Timeout: Not all CPUs entered broadcast exception handler
 Shutting down cpus with NMI
 Kernel Offset: disabled

Perhaps not related to your problems but this is the closest match to my problem I could find anywhere. I have not  had any similar problem with this machine since I got it, and it has been stable again since I stopped using Vulkan.
Comment 20 Bruno Gravato 2022-09-15 03:50:39 UTC
(In reply to Reny from comment #19)
> Random question since I have the same hardware and recently experienced a
> similar problem:
> 
> Is your machine perhaps using Vulkan for something?
> 

No, I don't use Vulkan and I never did... Sorry :-)

I still have the problem, but I haven't had the time for any further tests since my last post.

I've been on 5.18.xx kernels lately and it's still crashing randomly during idle periods... Because this is my daily driver I now often put it into standby/sleep mode during the night and other long idle periods.

For the last few days I've been leaving it on though, but leaving my USB audio interface on and connected via USB, to see if the power drawn by it via USB is enough to prevent it from going into that idle state that makes it crash...

BTW, I don't get any kernel panics in the logs, there's absolutely nothing in the logs prior to the crash.
Comment 21 Bruno Gravato 2022-11-27 21:30:04 UTC
I finally found a way of reproducing these freezes, using "systemctl hybrid-sleep".

I can put the system into suspend with "systemctl suspend", the power led in the NUC will blink (as expected). Pressing a mouse button or power button will wake the system and successfully resume from suspend.

If I put it to hibernate with "systemctl hibernate" it will successfully go into hibernation (RAM contents saved to swap; NUC completely turns off). Then upon boot it will successfully resume from hibernation.

With hybrid-sleep I believe it does a mix of both... the contents of the RAM are put into the swap partition (as in hibernate), but the system will go into suspend state instead.
This is what happens:
- hibernation data successfully stored
- system appears to go into suspend (power led blinks as in normal suspend mode)
- on button press the power led goes bright but it fails to resume, no signal reaches the display (hdmi connection), doesn't react to keyboard, doesn't reply to pings on the network, etc... all the same symptoms as in the "usual" freezes during idle periods
- only way of getting out of this state is to press and hold the power button to force a "hard" poweroff
- then on boot it successfully resumes from the hibernation data saved during systemctl hybrid-sleep
- looking at journalctl after this shows the expected messages before going into suspend and after resuming from hibernation, but there's no logs during the failed attempt to wake from suspend

Regarding power consumption measurements (from the wall, using a power meter), here's an interesting fact:
- after "normal" suspend it consumes 2.1W
- after hybrid-sleep it consumes only 1.5W

In either case shouldn't the system go into S3 mode and shouldn't the power consumption be the same?
Could this be related to the fact that in the BIOS setting I have it set for "modern standby" instead of "legacy S3 mode"? Are there different standby/S3 modes?

When it freezes upon trying to wake from hybrid-sleep the power consumption kind of stalls at around 19.2W which suggests it's "doing something", but I let it run for a few minutes and nothing changes... seems like it's stuck in some loop.

Ah! Almost forgot... I was able to reproduce this equally in both kernel 5.9.16 (the one that never crashed before) and the more recent kernel 6.0.3.

I don't know if these hybrid-sleep freezes are related to the ones I experience during idle periods, but the symptoms are pretty much the same, which reinforces that the problem must be related somehow to low power states.

systemctl hybrid-sleep works without issues on another computer (Thinkpad X230), running the same debian version.

Any thoughts on why suspend and hibernate work fine, but it fails to wake from hybrid-sleep? Anyone out there following this thread with a NUC8i5BEH can try systemctl hybrid-sleep and see if it resumes successfully and report back?

Note You need to log in before you can comment on or make changes to this bug.