Bug 219364 - Stalls unless C-states disabled - Intel Lunarlake - Dell XPS 13 9350, HP OmniBook Ultra Flip Laptop 14
Summary: Stalls unless C-states disabled - Intel Lunarlake - Dell XPS 13 9350, HP Omni...
Status: RESOLVED CODE_FIX
Alias: None
Product: Power Management
Classification: Unclassified
Component: intel_idle (show other bugs)
Hardware: All Linux
: P3 normal
Assignee: Len Brown
URL:
Keywords:
: 219477 (view as bug list)
Depends on:
Blocks:
 
Reported: 2024-10-09 01:13 UTC by AceLan Kao
Modified: 2024-11-19 09:17 UTC (History)
8 users (show)

See Also:
Kernel Version:
Subsystem:
Regression: No
Bisected commit-id:


Attachments
dmesg on DC (162.38 KB, text/plain)
2024-10-09 01:13 UTC, AceLan Kao
Details
cpuidle/state* log (2.70 KB, text/plain)
2024-10-16 03:40 UTC, AceLan Kao
Details
dmesg max_cstate=1 (219.86 KB, text/plain)
2024-10-16 03:41 UTC, AceLan Kao
Details
dmesg max_cstate=2 (208.50 KB, text/plain)
2024-10-16 03:41 UTC, AceLan Kao
Details
test X86_BUG_MONITOR patch (1.16 KB, patch)
2024-11-07 20:36 UTC, Len Brown
Details | Diff
test X86_BUG_MONITOR patch for Linux 6.10 and earlier (1.38 KB, patch)
2024-11-08 01:18 UTC, Len Brown
Details | Diff
standalone program to detect migration stalls on LNL (9.90 KB, text/x-csrc)
2024-11-08 01:39 UTC, Len Brown
Details
test X86_BUG_MONITOR patch (1.35 KB, patch)
2024-11-12 05:51 UTC, Len Brown
Details | Diff

Description AceLan Kao 2024-10-09 01:13:49 UTC
Created attachment 306989 [details]
dmesg on DC

Kernel: v6.12-rc2

In the dmesg, you can see that there are some abnormal hugh delays while booting up.

ex.
[    0.613464] pci_scan_bridge_extend: pci 0000:00:1c.0: scanning [bus 71-71] behind bridge, pass 1
[    0.613471] pci_scan_child_bus_extend: pci_bus 0000:00: bus scan returning with max=71
[    0.630696] ACPI: \_SB_.PEPD: Duplicate LPS0 _DSM functions (mask: 0x1)
[    4.617497] Low-power S0 idle used by default for system suspend
[    4.648486] ACPI: EC: interrupt unblocked

and 
[    8.911398] int3472-discrete INT3472:00: cannot find GPIO chip INTC10B5:00, deferring
[    8.939251] int3472-discrete INT3472:00: GPIO type 0x12 unknown; the sensor may not work
[    8.940853] int3472-discrete INT3472:00: cannot find GPIO chip INTC10B5:00, deferring
[   10.210778] xe 0000:00:02.0: [drm] Reducing the compressed framebuffer size. This may lead to less power savings than a non-reduced-size. Try to increase stolen memory size if available in BIOS.
[   11.078838] Creating 1 MTD partitions on "0000:00:1f.5":
[   11.078854] 0x000000000000-0x000004000000 : "BIOS"

and
[   11.900024] NET: Registered PF_QIPCRTR protocol family
[   12.996969] iwlwifi 0000:00:14.3: RFIm is deactivated, reason = 4
[   13.120034] iwlwifi 0000:00:14.3: Registered PHC clock: iwlwifi-PTP, with index: 0
[   14.034314] systemd-journald[365]: /var/log/journal/a8e0c8bb11df4dac858c05d4c8ab7e6b/user-1000.journal: Journal file uses a different sequence number ID, rotating.
[   14.351742] kauditd_printk_skb: 160 callbacks suppressed

The situation becomes minor with AC, but you still can feel it after booted up while using the machine.
Comment 1 AceLan Kao 2024-10-09 01:18:27 UTC
I recorded a video to show the strange pause while moving cursor.
https://people.canonical.com/~acelan/bugs/bz-219364/

The strange pause also happens when typing on the built-in the keyboard. When encountered the pause, it might repeat the final character many times after the pause, or nothing happened after the pause.
Comment 2 The Linux kernel's regression tracker (Thorsten Leemhuis) 2024-10-10 11:01:12 UTC
Did this happen with earlier kernels like 6.10 or 6.11 as well?
Comment 3 AceLan Kao 2024-10-11 03:09:53 UTC
Reproduce this issue with v6.11 kernel.
This is a new SoC, so didn't try older kernel.
Comment 4 Neo Wong 2024-10-11 04:40:23 UTC
1. Force the CPU to stay in C0 works
sudo su - ; cat >/dev/cpu_dma_latency <(echo -e -n "\x0\x0\x0\x0" ; sleep inf)

2. Disable C state works
“cpuidle.off=1 intel_idle.max_cstate=0“
Comment 5 Len Brown 2024-10-15 15:40:23 UTC
Please share the C-states on this machine out of the box:

grep . /sys/devices/system/cpu/cpu0/cpuidle/*/*

Can you discover the deepest state that results in no symptoms
by booting with descending numbers in intel_idle.max_cstate=N
(or disable states via /sys/devices/system/cpu/cpu*/cpuidle/state*/disable)

Do you see the same symptoms if you simply boot with maxcpus=1 ?
Comment 6 Len Brown 2024-10-15 15:42:18 UTC
Do the boot-time delays also go away with C-states off,
(please show dmesg for boot with C-states off)
or does disabling C-states only help with the interactive issues?
Comment 7 Srinivas Pandruvada 2024-10-15 16:09:08 UTC
Question on delay in using the system:

Are these noticeable delays in using is only DC mode issue?

Did you try upstream latest 6.12-rc* kernel?

Is setting makes any difference in DC mode
/sys/devices/system/cpu/cpu*/power/energy_perf_bias = 7
Comment 8 AceLan Kao 2024-10-16 03:40:43 UTC
Created attachment 307012 [details]
cpuidle/state* log

I can see pretty minor pause with `intel_idle.max_cstate=2`, and with `intel_idle.max_cstate=3` you can feel and see the pause while moving the cursor very clear.

And the boot up time also affects by the max_cstate, looks like it takes more time to boot up with the bigger cstate number.

The delay is easier to observe with DC, but I can observe it with AC with some reboot.
/sys/devices/system/cpu/cpu*/power/energy_perf_bias were 8, and set them to 7 helps, the pause becomes very minor.
Comment 9 AceLan Kao 2024-10-16 03:41:11 UTC
Created attachment 307013 [details]
dmesg max_cstate=1
Comment 10 AceLan Kao 2024-10-16 03:41:33 UTC
Created attachment 307014 [details]
dmesg max_cstate=2
Comment 11 Srinivas Pandruvada 2024-10-16 14:45:50 UTC
(In reply to AceLan Kao from comment #8)
> Created attachment 307012 [details]
> cpuidle/state* log
> 
> I can see pretty minor pause with `intel_idle.max_cstate=2`, and with
> `intel_idle.max_cstate=3` you can feel and see the pause while moving the
> cursor very clear.
> 
> And the boot up time also affects by the max_cstate, looks like it takes
> more time to boot up with the bigger cstate number.
> 
> The delay is easier to observe with DC, but I can observe it with AC with
> some reboot.
> /sys/devices/system/cpu/cpu*/power/energy_perf_bias were 8, and set them to
> 7 helps, the pause becomes very minor.

Who set this value to 8? Is Dell BIOS is doing this? Unless BIOS is changing the value should be 6.
Comment 12 Neo Wong 2024-10-16 16:14:48 UTC
I think two sentence were separate, from my observe, EPB is 8 in DC, and 6 in AC.
Comment 13 Srinivas Pandruvada 2024-10-16 17:00:47 UTC
But who set those EPBs? OS didn't.
Comment 14 Srinivas Pandruvada 2024-10-17 13:58:51 UTC
I got the answer. This is done by Linux power profile daemon.
Comment 15 Srinivas Pandruvada 2024-10-17 15:08:33 UTC
Please check EPB=7 fixes the issue.
Comment 16 AceLan Kao 2024-10-21 02:50:29 UTC
No, EPB=7 doesn't fix the issue.
You still may observe the behavior from time to time, especially let the system idle for a while, the strange pause while moving the cursor or the abnormal keyboard behavior.
Comment 17 hugh chao 2024-10-22 08:58:04 UTC
also set EPB=7 on my side:
echo 7 | tee /sys/devices/system/cpu/cpu*/power/energy_perf_bias

I can still observe the lag
Comment 18 hugh chao 2024-10-22 12:05:04 UTC
To summarize all symptoms, I observer system lag when:
- using touchpad
- using onboard keyboard
- typing when using ssh
- playing music (wav, mp3) via apps (Rythmbox, audacious, video...)
Comment 19 Kevin Cheng 2024-10-22 13:12:06 UTC
Please check the following command can fix the issue or not.
1. Disable ACPI C3
for i in {0..7}; do echo 1 > /sys/devices/system/cpu/cpu$i/cpuidle/state3/disable;done

2. Offline E Core
for i in {4..7}; do echo 0 > /sys/devices/system/cpu/cpu$i/online;done

3. Change the Power Mode in Settings -> Power -> Power Mode to Performance.
Comment 20 hugh chao 2024-10-22 14:27:20 UTC
I tried on touchpad + playing music at the same time as:

1. Make sure I observer the issue 
1.1 Disable ACPI C3 --> It works  
for i in {0..7}; do echo 1 > /sys/devices/system/cpu/cpu$i/cpuidle/state3/disable;done

2. Reboot and make sure I observer the issue 
2.2 Offline E Core --> It works
for i in {4..7}; do echo 0 > /sys/devices/system/cpu/cpu$i/online;done

3. Reboot and make sure I observer the issue 
3.3 Power Mode to Performance. --> this is not works
Comment 21 AceLan Kao 2024-10-23 02:46:07 UTC
Verified below tests with AC is connected(EPB = 6)

> 1. Disable ACPI C3
> for i in {0..7}; do echo 1 > 
> /sys/devices/system/cpu/cpu$i/cpuidle/state3/disable;done
This works
echo 0 back to those files and reproduce the issue quickly.

> 2. Offline E Core
> for i in {4..7}; do echo 0 > /sys/devices/system/cpu/cpu$i/online;done
This works, too.
echo 1 back to those files and reproduce the issue quickly.

> 3. Change the Power Mode in Settings -> Power -> Power Mode to Performance.
(EPB = 0)
Encountered the issue, it doesn't imporve the situation, sometimes the pause is the same serious as DC power.
Comment 22 Neo Wong 2024-10-28 14:55:57 UTC
Can you try Linux 6.11.0-8-generic, even in #3 you mentioned 6.11 is reproduced. But just like to double confirm if this specified version can reproduce.
Comment 23 AceLan Kao 2024-10-28 16:01:43 UTC
Yes, 6.11.0-8 is hard to reproduce the issue. But you still can observe the pause while moving the cursor sometimes.
The situation is much lighter than other kernels, but I won't say there is no issue with this kernel.
Comment 24 Neo Wong 2024-10-28 16:11:11 UTC
Can you help to check if system could enter C3 or S0ix ?
Comment 25 Neo Wong 2024-10-28 17:36:39 UTC
BTW, where I can get mainline v6.12-rc2 kernel ? Seems only update till 6.11 : https://kernel.ubuntu.com/mainline/
Comment 26 AceLan Kao 2024-10-29 09:12:30 UTC
Yes, the Package C2, C6, and C10 counters increase in /sys/kernel/debug/pmc_core/package_cstate_show.

And the slp_s0_residency_usec counter increases when suspended.

We're moving our servers, so you may have to build the kernel by yourself.
Comment 27 Len Brown 2024-11-07 20:36:20 UTC
Created attachment 307180 [details]
test X86_BUG_MONITOR patch

please try this patch
Comment 28 Len Brown 2024-11-08 01:18:59 UTC
Created attachment 307184 [details]
test X86_BUG_MONITOR patch for Linux 6.10 and earlier

Same logical patch as above,
but this version applies to the syntax of Linux 6.10 and earlier,
while the version above applies to the syntax of 6.11 and later.
Comment 29 Len Brown 2024-11-08 01:29:27 UTC
*** Bug 219477 has been marked as a duplicate of this bug. ***
Comment 30 Len Brown 2024-11-08 01:39:04 UTC
Created attachment 307185 [details]
standalone program to detect migration stalls on LNL

The stand-alone program can detect stalls.

It uses sched_setaffinity(2) to migrate itself from cpu(4-7) to cpu(0-3),
using rdtsc to measure how long the migration takes.

The program will complain about migrations that take more than 1ms.
Sometimes it will detect stalls over 1000ms.

By default it migrates 1000 times, once ever 250ms, running for 250 sec.
Comment 31 Len Brown 2024-11-12 05:51:11 UTC
Created attachment 307208 [details]
test X86_BUG_MONITOR patch

This patch applies to Linux-6.11 and later
Comment 32 AceLan Kao 2024-11-19 00:42:28 UTC
Verified with the mainline v6.12 and Ubuntu 6.11 kernels, the patch fixes the issue.
Thanks.
Comment 33 Mingcong Bai 2024-11-19 09:17:43 UTC
(In reply to Len Brown from comment #31)
> Created attachment 307208 [details]
> test X86_BUG_MONITOR patch
> 
> This patch applies to Linux-6.11 and later

Fix verified on an Asus Vivobook S 2024 OLED. Thanks!

Note You need to log in before you can comment on or make changes to this bug.