200133 – Random Freezes Unless Disable Intel P-States intel_pstate=disable(turbo disabled) (Dell Precision 5510, Skylake i5-6440HQ)

Bug 200133 - Random Freezes Unless Disable Intel P-States intel_pstate=disable(turbo disabled) (Dell Precision 5510, Skylake i5-6440HQ)

Summary: Random Freezes Unless Disable Intel P-States intel_pstate=disable(turbo disab...

Status:	CLOSED DOCUMENTED

Alias:	None

Product:	Power Management
Classification:	Unclassified
Component:	intel_pstate (show other bugs)
Hardware:	Intel Linux

Importance:	P1 normal
Assignee:	Srinivas Pandruvada

URL:
Keywords:

Depends on:
Blocks:

Reported:	2018-06-19 03:15 UTC by carbonchauvinist
Modified:	2021-05-24 13:44 UTC (History)
CC List:	6 users (show)

See Also:
Kernel Version:	4.17.2-1-ARCH
Subsystem:
Regression:	No
Bisected commit-id:

Attachments
hwinfo (1.32 MB, text/plain) 2018-06-19 03:15 UTC, carbonchauvinist	Details
hwinfo-correct (1.31 MB, text/plain) 2018-06-19 03:19 UTC, carbonchauvinist	Details
lspci -nnv (12.47 KB, text/plain) 2018-06-19 03:21 UTC, carbonchauvinist	Details
cpuinfo (4.94 KB, text/plain) 2018-06-19 03:23 UTC, carbonchauvinist	Details
4.18-rc2 journal without "intel_pstate=disable" (84.16 KB, text/plain) 2018-06-30 16:52 UTC, carbonchauvinist	Details
4.18-rc2 multi-user.target journal without "intel_pstate=disable" (85.82 KB, text/plain) 2018-06-30 17:06 UTC, carbonchauvinist	Details
acpi.out (905.37 KB, text/plain) 2018-06-30 17:57 UTC, carbonchauvinist	Details
turbostat acpi-cpufreq (11.16 KB, text/plain) 2018-06-30 18:01 UTC, carbonchauvinist	Details
turbostat intel_pstate (13.33 KB, text/plain) 2018-06-30 18:07 UTC, carbonchauvinist	Details
4.17.2-1-ARCH journal with "dyndbg="file intel_pstate.c +p" intel_pstate=no_hwp intel_pstate=support_acpi_ppc" (256.83 KB, text/plain) 2018-06-30 19:53 UTC, carbonchauvinist	Details
turbostat --debug acpi-freq with increased system load (108.64 KB, text/plain) 2018-06-30 20:26 UTC, carbonchauvinist	Details
journal 4.17.3-1-ARCH with "dyndbg="file intel_pstate.c +p" intel_pstate=no_hwp intel_pstate=support_acpi_ppc" and disabled turbo (277.01 KB, text/plain) 2018-07-01 13:19 UTC, carbonchauvinist	Details
turbostat --debug intel_pstate w/ edited msr, disabled turbo, increased load (74.76 KB, text/plain) 2018-07-01 14:30 UTC, carbonchauvinist	Details
dmesg acpi-cpufreq at boot intel_pstate=disable dyndbg="file processor_perflib.c +p" (68.72 KB, text/plain) 2018-07-01 18:56 UTC, carbonchauvinist	Details
dmesg acpi-cpufreq increased load intel_pstate=disable dyndbg="file processor_perflib.c +p" (68.81 KB, text/plain) 2018-07-01 19:00 UTC, carbonchauvinist	Details
acpi-cpufreq schedutil governor turbostat single thread stress (23.42 KB, text/plain) 2018-07-01 22:29 UTC, carbonchauvinist	Details
acpi-cpufreq performance governor turbostat single thread stress (25.90 KB, text/plain) 2018-07-01 22:35 UTC, carbonchauvinist	Details
dmesg acpi-cpufreq performance default governor (68.11 KB, text/plain) 2018-07-03 02:42 UTC, carbonchauvinist	Details
turbostat --debug acpi-freq performance default increased load (30.87 KB, text/plain) 2018-07-03 02:47 UTC, carbonchauvinist	Details
dmesg intel-cpufreq powersave default governor boot intel_pstate=passive dyndbg="file processor_perflib.c +p" (68.03 KB, text/plain) 2018-07-04 00:07 UTC, carbonchauvinist	Details
journal intel-cpufreq powersave default governor boot intel_pstate=passive dyndbg="file processor_perflib.c +p" (327.56 KB, text/plain) 2018-07-04 00:12 UTC, carbonchauvinist	Details
boot kernel command line turbo disable (1.08 KB, patch) 2018-07-05 22:29 UTC, Srinivas Pandruvada	Details \| Diff
disable_boot_turbo_option (1.07 KB, patch) 2018-07-06 17:01 UTC, Srinivas Pandruvada	Details \| Diff
dmesg turbo_disable_patch 4.18.0-rc3 intel_pstate=no_turbo (69.01 KB, text/plain) 2018-07-06 17:03 UTC, carbonchauvinist	Details
journalctl -k 4.18.0-rc3 intel_pstate=no_turbo freeze (76.79 KB, text/plain) 2018-07-06 20:54 UTC, carbonchauvinist	Details
journalctl -k 4.17.3-1 (76.79 KB, text/plain) 2018-07-06 21:27 UTC, carbonchauvinist	Details
journalctl -k 4.17.3-1 patch intel_pstate=no_turbo - freeze when turbo enabled (83.87 KB, text/plain) 2018-07-06 21:32 UTC, carbonchauvinist	Details
turbostat 4.17.3-1 intel_pstate=no_turbo (increased load included) (14.73 KB, text/plain) 2018-07-08 15:36 UTC, carbonchauvinist	Details
turbostat 4.17.3-1 intel_pstate=no_turbo - enabled turbo (increased load included) (12.79 KB, text/plain) 2018-07-08 15:41 UTC, carbonchauvinist	Details
turbostat 4.17.3-1 intel_pstate=no_turbo cpufreq.off=1 custom MSR values with increased load (16.78 KB, text/plain) 2018-07-08 15:56 UTC, carbonchauvinist	Details
**CORRECTED turbostat 4.17.3-1 intel_pstate=no_turbo cpufreq.off=1 custom MSR values with increased load CORRECTED** (26.21 KB, text/plain) 2018-07-08 18:27 UTC, carbonchauvinist	Details
4.18.13.-1 freeze on boot (4.06 MB, image/jpeg) 2018-07-11 22:22 UTC, carbonchauvinist	Details
intel_pstate=disable kernel 5.7.8-arch1-1 kernel null pointer dereference #pf supervisor read access (107.76 KB, text/plain) 2020-07-14 00:58 UTC, carbonchauvinist	Details
intel_pstate enabled; turbo enabled; graphic disabled; still kernel panic (2.44 MB, image/jpeg) 2020-07-15 13:32 UTC, carbonchauvinist	Details
5.13.0-rc2-1-mainline - mce when booting with turbo enabled (546.83 KB, image/jpeg) 2021-05-23 02:07 UTC, carbonchauvinist	Details
Show Obsolete (3) Add an attachment (proposed patch, testcase, etc.)

Description carbonchauvinist 2018-06-19 03:15:14 UTC

Created attachment 276677 [details]
hwinfo

Dell Precision 5510 with latest available BIOS (1.7.0) running latest Arch stable kernel. SKylake i5-6440HQ with integrated HD 530 only.

Symptoms are random freezes, hard locks which cause the caps lock key to blink and only able to perform a hard power reset. Journalctl does not appear to capture any logs when this happens.

I'm using i915 Intel driver (not modesetting) and have it loading early in initramfs (passed as module in mkinitcpio.conf) and sometimes the freezes occur during boot.

Disabling intel_pstate and falling back on acpi-cpufreq resolves the issues for me and I experience no more freezing. I've tried intel_pstate=passive and intel_pstate=no_hwp to no effect, only completely disabling works.

$ cat /proc/cmdline
initrd=\intel-ucode.img initrd=\initramfs-linux.img rw rd.luks.uuid=145628bb-0138-4b8b-bc94-2d041c756539 rd.luks.name=145628bb-0138-4b8b-bc94-2d041c756539=lvm root=/dev/lvmvg/root quiet acpi_backlight=vendor intel_pstate=disable

$ cat /sys/devices/system/cpu/cpu*/cpufreq/scaling_driver
acpi-cpufreq
acpi-cpufreq
acpi-cpufreq
acpi-cpufreq

I would prefer to be able to use intel_pstate driver though if at all possible.

I've attached hwinfo, lspci, cpuinfo and can provide any additional information which may be needed.

Comment 1 carbonchauvinist 2018-06-19 03:19:43 UTC

Created attachment 276679 [details]
hwinfo-correct

corrected hwinfo

Comment 2 carbonchauvinist 2018-06-19 03:21:26 UTC

Created attachment 276681 [details]
lspci -nnv

Comment 3 carbonchauvinist 2018-06-19 03:23:47 UTC

Created attachment 276683 [details]
cpuinfo

Comment 4 Doug Smythies 2018-06-29 14:44:02 UTC

It seems that Intel has not updated the maintainer (the assignee) for the intel_pstate driver, it hasn't been Kristen for a few years now. (i.e. I wonder if the current maintainers have seen this.)

Have you tried kernel 4.17.3? or 4.18-rc2? and does the problem still occur with them?

What is the average frequency of occurrence of the issue? i.e. hours, days, weeks?

Comment 5 carbonchauvinist 2018-06-30 13:17:00 UTC

I just tried 4.18-rc2 and still have issues, though manifested differently. Mainly, now without the "intel_pstate=disable", I can barely get through a boot. I just had it hard-lock up three times in a row, before getting to SDDM, during boot trying to boot 4.18.-rc2. The fourth time trying I was at least able to get to SDDM, but it hung immediately. Here's the log from that instance.

https://ptpb.pw/kTnB

Interestingly, loading my current kernel, 4.17.2-1-ARCH, without the "intel_pstate=disable" I also have issues with it hard-locking during boot; around ~70% of the time it will hang during boot.

When I am able to boot successfully, in the limited trials I've done so far, I seem to be able to run for long periods of time without any hangs (~2-3 hours tested so far). Previously I would have had hangs with frequencies of minutes in some instances. 

The issue now with 4.17.2-1-ARCH, when I am able to get past the boot process without "intel_pstate=disable" enabled, is I am unable to suspend properly. When running "systemctl suspend" and trying to resume my computer reboots.

Comment 6 carbonchauvinist 2018-06-30 13:46:16 UTC

Tried booting into multi-user.target instead of graphical.target with 4.18-rc2 without disabling intel_pstate and still got an almost immediate freeze when logging in, log pasted below:

https://ptpb.pw/pSKH

Comment 7 Doug Smythies 2018-06-30 14:29:18 UTC

You could try updating the microcode for your processor, as 0XC6 is the current version and you have 0XC2.

Your logs should be attached here, instead of an external reference. There seems to be ACPI issues in your log.

Was everything O.K. with an older kernel? (you might have to go back to 4.13 to test) And if yes, do you know how to bisect the kernel to isolate the problem commit?

I'll see if the current intel_pstate maintainers can have a look at this.

Comment 8 Srinivas Pandruvada 2018-06-30 16:35:23 UTC

First always run with latest BIOS.

Since you have issue in HWP mode, where driver doesn't do much other than activation. So I think something else is triggering this. Depending on the configuration acpi-cpufreq on such system it can't do much cpu frequency scaling, so may not be triggering some other issues.

Since this system was previously loaded with Windows, so we need to check if would have used HWP at all. Also what is the configuration acpi-cpufreq can use.
So attach:
output of acpidump
#acpidump > acpi.out

I think you can still run for few moments before lockup:
run
#turbostat --debug
with acpi-cpufreq and intel_pstate (default)

Comment 9 Srinivas Pandruvada 2018-06-30 16:48:45 UTC

Also try one more combination, by changing kernel command line

dyndbg="file intel_pstate.c +p" intel_pstate=no_hwp intel_pstate=support_acpi_ppc

Comment 10 carbonchauvinist 2018-06-30 16:52:22 UTC

Created attachment 277085 [details]
4.18-rc2 journal without "intel_pstate=disable"

Here's the log from booting 4.18-rc2 without the "intel_pstate=disable" parameter, took me four times to actually get through boot process and have SDDM load, only to immediately hang after entering my password (still at SDDM screen no desktop loaded).

Comment 11 carbonchauvinist 2018-06-30 17:06:25 UTC

Created attachment 277087 [details]
4.18-rc2 multi-user.target journal without "intel_pstate=disable"

Here's the journal from booting into systemd.unit=multi-user.target on 4.18-rc2 without the "intel_pstate=disable" parameter; booted on first try (though that may have just been lucky), but still immediately hung after entering my password.

Comment 12 carbonchauvinist 2018-06-30 17:36:37 UTC

@Doug Smythies

Interesting, I have enabled all of the steps required for microcode update and it seems to think I should be on 0xc2; see below:

$ dmesg | grep microcode
[    0.765652] microcode: sig=0x506e3, pf=0x20, revision=0xc2
[    0.765799] microcode: Microcode Update Driver: v2.2.

$ grep microcode /proc/cpuinfo
microcode       : 0xc2
microcode       : 0xc2
microcode       : 0xc2
microcode       : 0xc2

$ bsdtar -Oxf /boot/intel-ucode.img | iucode_tool -tb -lS -
iucode_tool: system has processor(s) with signature 0x000506e3
microcode bundle 1: (stdin)
selected microcodes:
  001/160: sig 0x000506e3, pf_mask 0x36, 2017-11-16, rev 0x00c2, size 99328

$ pacman -Qi intel-ucode
Name            : intel-ucode
Version         : 20180425-1
Description     : Microcode update files for Intel CPUs
Architecture    : any
URL             : https://downloadcenter.intel.com/SearchResult.aspx?lang=eng&keyword=processor%20microcode%20data%20file
Licenses        : custom
Groups          : None
Provides        : None
Depends On      : None
Optional Deps   : None
Required By     : None
Optional For    : None
Conflicts With  : None
Replaces        : microcode_ctl
Installed Size  : 1639.00 KiB
Packager        : Christian Hesse <arch@eworm.de>
Build Date      : Mon 07 May 2018 05:11:33 PM EDT
Install Date    : Wed 09 May 2018 01:26:53 PM EDT
Install Reason  : Explicitly installed
Install Script  : No
Validated By    : Signature

I am running the latest BIOS from Dell, I do realize the ACPI seems to be buggy. Not sure what I can do about that.

$ sudo dmidecode|head -n 37
# dmidecode 3.1
Getting SMBIOS data from sysfs.
SMBIOS 2.8 present.
91 structures occupying 5667 bytes.
Table at 0x000EAC00.

Handle 0x0000, DMI type 0, 24 bytes
BIOS Information
        Vendor: Dell Inc.
        Version: 1.7.0
        Release Date: 02/23/2018
        Address: 0xF0000
        Runtime Size: 64 kB
        ROM Size: 16 MB
        Characteristics:
                PCI is supported
                PNP is supported
                BIOS is upgradeable
                BIOS shadowing is allowed
                Boot from CD is supported
                Selectable boot is supported
                EDD is supported
                5.25"/1.2 MB floppy services are supported (int 13h)
                3.5"/720 kB floppy services are supported (int 13h)
                3.5"/2.88 MB floppy services are supported (int 13h)
                Print screen service is supported (int 5h)
                8042 keyboard services are supported (int 9h)
                Serial services are supported (int 14h)
                Printer services are supported (int 17h)
                ACPI is supported
                USB legacy is supported
                Smart battery is supported
                BIOS boot specification is supported
                Function key-initiated network boot is supported
                Targeted content distribution is supported
                UEFI is supported
        BIOS Revision: 1.7


There was a point in time where I didn't have these issues, might have been 4.13, but can't be certain - I always put it off to look into later hoping newer kernels would address.

I know nothing about bisects but would be willing to look into what that entails.

Comment 13 carbonchauvinist 2018-06-30 17:56:10 UTC

@Srinivas Pandruvada

I'm uploading the files you requested now.

Please note this machine was originally purchased with Ubuntu and did not come installed with Windows. I have added an m.2 nvme drive with windows install recently though, but I don't think is related to my issues.

I will try to boot with the additional parameters you suggested (dyndbg="file intel_pstate.c +p" intel_pstate=no_hwp intel_pstate=support_acpi_ppc) and report back.

Comment 14 carbonchauvinist 2018-06-30 17:57:58 UTC

Created attachment 277089 [details]
acpi.out

@Srinivas Pandruvada

So attach:
output of acpidump
#acpidump > acpi.out

Comment 15 carbonchauvinist 2018-06-30 18:01:29 UTC

Created attachment 277091 [details]
turbostat acpi-cpufreq

@Srinivas Pandruvada 

run
#turbostat --debug
with acpi-cpufreq

Comment 16 carbonchauvinist 2018-06-30 18:07:45 UTC

Created attachment 277093 [details]
turbostat intel_pstate

@ Srinivas Pandruvada 
I think you can still run for few moments before lockup:
run
#turbostat --debug
with intel_pstate (default)

** Please note, I now seem to be able to run with intel_pstate enabled without getting hard-locks (at least in limited testing so far, lasting up to 3-4 hours, wherein before it would lock in minutes at times). 

However, now the issue is if I boot without disabling intel_pstate I will NOT make it through the boot process up to ~70% of the time. It will hang/hard-lock at random points throughout the boot before loading SDDM. Othertimes, even if SDDM loads it will hang prior to loading the desktop. If I make it to the desktop though - it appears to be okay.

The other issue when not disabling intel_pstate, the few times I actually make it to a desktop, is when suspending the computer will reboot when attempting to resume from said suspend.

Comment 17 Srinivas Pandruvada 2018-06-30 18:20:16 UTC

Pretty idle systems in both cases. Can you run something and collect turbostate:
- I want to see if you are able to scale frequency with acpi-cpufreq

I think you may be getting some thermal event during boot and resume, which acpi-cpufreq can handle. So try with the options I suggested, it may help and dump something in logs also.

Comment 18 carbonchauvinist 2018-06-30 19:53:14 UTC

Created attachment 277095 [details]
4.17.2-1-ARCH journal with "dyndbg="file intel_pstate.c +p" intel_pstate=no_hwp intel_pstate=support_acpi_ppc"

@Srinivas Pandruvada

Here's a log from a boot with the parameters you instructed. Did get a hang during boot the first time, was able to make it past boot to desktop the second time. However, upon putting the system in suspend, when trying to resume system hung. 

Will get you better turbostate logs when using acpi-cpufreq here shortly.

Comment 19 carbonchauvinist 2018-06-30 20:26:20 UTC

Created attachment 277097 [details]
turbostat --debug acpi-freq with increased system load

@Srinivas Pandruvada

Here's a turbostat which hopefully is more useful?

Comment 20 Srinivas Pandruvada 2018-06-30 21:52:35 UTC

I see that acpi-cpufreq is basically running without turbo even with 98% load. 
So this may be preventing some thermal issues.
Also I see some thermal events.

When you were booted with intel-pstate. Disable turbo
#echo 1 > /sys/device/system/cpu/intel_pstate/no_turbo

and do suspend resume and check.

Comment 21 Srinivas Pandruvada 2018-07-01 02:33:05 UTC

Also along with above suggestion also do
#rdmsr 0x1a0
whatever you read just set bit 38 of it
For example you read 0x850089
then write
0x4000850089

#wrmsr 0x1a0 0x4000850089

Comment 22 carbonchauvinist 2018-07-01 13:19:26 UTC

Created attachment 277103 [details]
journal 4.17.3-1-ARCH with "dyndbg="file intel_pstate.c +p" intel_pstate=no_hwp intel_pstate=support_acpi_ppc" and disabled turbo

@Srinivas Pandruvada

Here's another journal from a recent boot. I booted with the additional kernel parameters you've instructed (now on 4.17.3-1-ARCH) and disabling the turbo as per your instructions. 

$ cat /etc/systemd/system/disable_turbo.service 
[Unit]
Description=disable turbo when on intel_pstate

[Service]
Type=oneshot
RemainAfterExit=yes
ExecStart = /home/ghost/scripts/disable_pstate_turbo.sh 

[Install]
WantedBy=multi-user.target


$ cat scripts/disable_pstate_turbo.sh 
#!/bin/sh
if [ -f /sys/devices/system/cpu/intel_pstate/no_turbo ]; then echo 1 > /sys/devices/system/cpu/intel_pstate/no_turbo && printf "intel_pstate turbo disabled" ; else printf "acpi-freq not intel_pstate - no turbo to disable"; fi

Again, with pstate enabled (even if hwp is disabled) once I make it past the boot process (which is no small feat) the system is reliable. I was able to suspend multiple times during this session and bring the computer back up. 

However, at the end of the night I suspended and while suspended plugged in the ac adapter to leave overnight to charge. This morning when I went to resume, I unplugged the ac adapter and tried to resume. The computer performed a reboot again instead of resume. 

It literally took me booting ten (10) times this morning before I was able to make it through a boot to my desktop.

I'll be happy to make those changes to the MSR you recommended. Can you please advise as to what that's actually changing though? I assume that's not a permanent change and would need to be done on each boot when testing? I tried looking up what that specific MSR is in the Intel® 64 and IA-32 Architectures Software Developer’s Manual without luck (obviously over my head).

Comment 23 carbonchauvinist 2018-07-01 14:30:04 UTC

Created attachment 277105 [details]
turbostat --debug intel_pstate w/ edited msr, disabled turbo, increased load

I made the edits to the msr you instructed. They did read as only "850089" at first, and after writing the new value this is what I currently read:

$ sudo rdmsr -a 0x1a0
4000850089
4000850089
4000850089
4000850089

Ran another turbostat with increased load in case that's helpful.

Comment 24 Srinivas Pandruvada 2018-07-01 15:00:31 UTC

Changing MSR is not the solution. This basically is proving theory that the platform is having thermal issue when boot or resume. Disabling turbo will help here, hence you are able to resume. Once system is on you can use thermald to control thermal. But during boot it won't help.

So we need to debug why acpi-cpufreq is not using turbo at all, either because of a bug or got some notification.

So can you boot with acpi-cpufreq with intel_pstate disabled with the following kernel parameters. Collect dmesg. Then run some workload and collect dmesg.

intel_pstate=disable "dyndbg="file processor_perflib.c +p".

Comment 25 carbonchauvinist 2018-07-01 18:52:30 UTC

I see; re: finding out why acpi-cpufreq is not using turbo at all may have had to do with TLP being installed? I've disabled TLP for the time being just in case. Please note the intel_pstate issues were present irrespective of having TLP installed and enabled.

I'll attach the dmesg output you requested here shortly.

Comment 26 carbonchauvinist 2018-07-01 18:56:23 UTC

Created attachment 277109 [details]
dmesg acpi-cpufreq at boot intel_pstate=disable dyndbg="file processor_perflib.c +p"

Boot with acpi-cpufreq with intel_pstate disabled with the following kernel parameters. Collect dmesg.

intel_pstate=disable dyndbg="file processor_perflib.c +p"

Comment 27 carbonchauvinist 2018-07-01 19:00:24 UTC

Created attachment 277111 [details]
dmesg acpi-cpufreq increased load intel_pstate=disable dyndbg="file processor_perflib.c +p"

Boot with acpi-cpufreq with intel_pstate disabled with the following kernel parameters ... Then run some workload and collect dmesg.

intel_pstate=disable dyndbg="file processor_perflib.c +p"

Comment 28 Srinivas Pandruvada 2018-07-01 21:16:13 UTC

I don't see any thermal notification to limit frequency and previous log showed that acpi-cputfreq didn't go turbo at all. That may be a bug. If we fix this, probably you will have the same issue with acpi-cpufreq.

You already checked that disabling turbo helped in intel-pstate during resume.
So this is somehow caused by system thermal.

For double confirm, run both governors and check a single thread workload like stress -c 1 and run turbostat.

I will see if I can get get the same platform to check.

Comment 29 carbonchauvinist 2018-07-01 22:29:00 UTC

Created attachment 277117 [details]
acpi-cpufreq schedutil governor turbostat single thread stress

For double confirm, run both governors and check a single thread workload like stress -c 1 and run turbostat.

schedutil governor (default)

$ cat /sys/devices/system/cpu/cpu*/cpufreq/scaling_available_governors 
performance schedutil 
performance schedutil 
performance schedutil 
performance schedutil 

$  cat /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor
schedutil
schedutil
schedutil
schedutil

Comment 30 carbonchauvinist 2018-07-01 22:35:44 UTC

Created attachment 277119 [details]
acpi-cpufreq performance governor turbostat single thread stress

For double confirm, run both governors and check a single thread workload like stress -c 1 and run turbostat.

performance governor

$ cat /sys/devices/system/cpu/cpu*/cpufreq/scaling_available_governors 
performance schedutil 
performance schedutil 
performance schedutil 
performance schedutil 

$  cat /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor
performance
performance
performance
performance

Comment 31 Srinivas Pandruvada 2018-07-01 23:51:27 UTC

I do see turbo with acpi-cpufreq. Don't know why it was in the previous log with 96%+ busy.
Let me check if I can find this platform to try.

Comment 32 carbonchauvinist 2018-07-02 03:05:01 UTC

I think TLP was the culprit for the discrepancy (see possibly relevant sections from /etc/default/tlp below that could explain); turbo or no-turbo may have depended on if I was running on AC or battery?

# Set energy performance hints (HWP) for Intel P-state governor:
#   performance, balance_performance, default, balance_power, power
# Values are given in order of increasing power saving.
# Note: Intel Skylake or newer CPU and Kernel >= 4.10 required.
CPU_HWP_ON_AC=balance_performance
CPU_HWP_ON_BAT=balance_power

# Minimize number of used CPU cores/hyper-threads under light load conditions:
#   0=disable, 1=enable.
SCHED_POWERSAVE_ON_AC=0
SCHED_POWERSAVE_ON_BAT=1

# Set CPU performance versus energy savings policy:
#   performance, balance-performance, default, balance-power, power.
# Values are given in order of increasing power saving.
# Requires kernel module msr and x86_energy_perf_policy from linux-tools.
ENERGY_PERF_POLICY_ON_AC=performance
ENERGY_PERF_POLICY_ON_BAT=power

Comment 33 Srinivas Pandruvada 2018-07-02 17:19:43 UTC

If you change default acpi-cpufreq governor to performance at boot, do you still don't have issue? For this to be default you need to change the kernel config
CONFIG_CPU_FREQ_DEFAULT_GOV_PERFORMANCE=y.

Comment 34 Srinivas Pandruvada 2018-07-02 17:21:11 UTC

and ofcourse rebuild the kernel after changing the kernel config.

Comment 35 Doug Smythies 2018-07-02 21:42:23 UTC

(In reply to carbonchauvinist from comment #12)
> @Doug Smythies
> 
> Interesting, I have enabled all of the steps required for microcode update
> and it seems to think I should be on 0xc2; see below:
> 

See here:

https://www.intel.com/content/dam/www/public/us/en/documents/sa00115-microcode-update-guidance.pdf

at the bottom of page 15.
But let's put that on hold for now.

(In reply to carbonchauvinist from comment #25)
> I see; re: finding out why acpi-cpufreq is not using turbo at all may have
> had to do with TLP being installed? I've disabled TLP for the time being
> just in case. Please note the intel_pstate issues were present irrespective
> of having TLP installed and enabled.

Then suggest to leave TLP disbaled and/or removed and out of it.

(In reply to carbonchauvinist from comment #32)
> I think TLP was the culprit for the discrepancy (see possibly relevant
> sections from /etc/default/tlp below that could explain); turbo or no-turbo
> may have depended on if I was running on AC or battery?

Again, suggest to take TLP out of it. I don't know about this particular Dell computer, but at least some force different operating conditions depending on the power adapter being plugged in or not, and even if the adapter is not a Dell adapter. These tests must be done under the exact same conditions between tests. 

(In reply to carbonchauvinist from comment #12)
> @Doug Smythies
> There was a point in time where I didn't have these issues, might have been
> 4.13, but can't be certain - I always put it off to look into later hoping
> newer kernels would address.

Myself (but I'd defer to Srinivas), I think that would be important to identify for certain.

Comment 36 carbonchauvinist 2018-07-02 23:31:25 UTC

(In reply to Doug Smythies from comment #35)
> See here:
> 
> https://www.intel.com/content/dam/www/public/us/en/documents/sa00115-
> microcode-update-guidance.pdf
> 
> at the bottom of page 15.
> But let's put that on hold for now.

Thanks, for this, and for getting this to the right people to begin the process. I've been assured by folks on arch irc that I have the latest publicly available microcode for my cpu at this time.

> Then suggest to leave TLP disbaled and/or removed and out of it.
> Again, suggest to take TLP out of it. I don't know about this particular
> Dell computer, but at least some force different operating conditions
> depending on the power adapter being plugged in or not, and even if the
> adapter is not a Dell adapter. These tests must be done under the exact same
> conditions between tests. 

Of course, I've already disabled and removed TLP entirely from my system for the time being. I'm embarrassed I forgot to mention I'd installed it on my machine once I finally found a solution for my freezing issues (disabling intel_pstate) and didn't put the two together about the turbo discrepancies until much later than I should have.

> Myself (but I'd defer to Srinivas), I think that would be important to
> identify for certain.

I can look to try this out this upcoming weekend; I guess I can get just download the old linux versions from arch archive. Maybe just download all >= 4.10 and then just continue to test them until I run into the problem?

Comment 37 carbonchauvinist 2018-07-03 02:37:54 UTC

(In reply to Srinivas Pandruvada from comment #33)
> If you change default acpi-cpufreq governor to performance at boot, do you
> still don't have issue? For this to be default you need to change the kernel
> config
> CONFIG_CPU_FREQ_DEFAULT_GOV_PERFORMANCE=y.

I'm not sure the issue you're referring to here specifically. I think not hitting the turbo on acpi-freq I was experiencing earlier when troubleshooting was due to TLP and/or being on battery and not AC.

I've now removed TLP completely for sake of testing and I'm running all tests connected to AC moving forward (unless asked otherwise) and apologize for the noise.

I've built a version of 4.17.3-1-ARCH with performance default governor. I tried one boot without disabling intel_pstate and it hung in the boot process, so that issue is still present.

I then booted this new kernel with the intel_pstate=disable dyndbg="file processor_perflib.c +p" parameters. I'll attach the dmesg and turbostat shortly, but it appeared to go into Turbo mode when increasing load.

Is there a better way to capture the turbostat output? I've been using "sudo turbostat --debug > turbostat.out 2>&1" but it seems to truncate (or not capture all) the output?

Comment 38 carbonchauvinist 2018-07-03 02:42:42 UTC

Created attachment 277139 [details]
dmesg acpi-cpufreq performance default governor

dmesg at boot with acpi-freq and performance as rebuilt kernel default governor

Comment 39 carbonchauvinist 2018-07-03 02:47:25 UTC

Created attachment 277141 [details]
turbostat --debug acpi-freq performance default increased load

turbostat with increased load with performance as default governor

only additional line to dmesg from boot after increased load was:

$ diff dmesg.out dmesg_after.out 
1013a1014
> [  181.557943] capability: warning: `turbostat' uses 32-bit capabilities
> (legacy support in use)

Comment 40 superm1 2018-07-03 02:47:54 UTC

Created attachment 277143 [details]
attachment-15565-0.html

?
Hello,

I will be OOO between July 3 and July 5. Expect delayed response.

Comment 41 Srinivas Pandruvada 2018-07-03 21:30:43 UTC

So even acpi-cpufreq performance governor as default didn't show any PPC notifications if there is any thermal issue.
Even if intel-pstate dafault powersave behaved aggressively then also it will still can't be more aggressive than acpi-cpufreq performance governor. So kind of very strange issue.

Only additional thing acpi-cpufreq would have done is to inform SMM.  But acpidump shows  acpi_gbl_FADT.pstate_control as 0 (" P-State Control : 00"), so this call will simply return.

One thing was suggested by Len, is that build with CONFIG_CPU_FREQ_DEFAULT_GOV_POWERSAVE=y and other CONFIG_CPU_FREQ_DEFAULT_GOV_* as "not set"
and boot with kernel command line "intel_pstate=passive".
and see if it can boot.

Comment 42 carbonchauvinist 2018-07-04 00:03:17 UTC

Okay. So rebuilt as instructed:

# CPU Frequency scaling
#
CONFIG_CPU_FREQ=y
CONFIG_CPU_FREQ_GOV_ATTR_SET=y
CONFIG_CPU_FREQ_GOV_COMMON=y
CONFIG_CPU_FREQ_STAT=y
# CONFIG_CPU_FREQ_DEFAULT_GOV_PERFORMANCE is not set
CONFIG_CPU_FREQ_DEFAULT_GOV_POWERSAVE=y
# CONFIG_CPU_FREQ_DEFAULT_GOV_USERSPACE is not set
# CONFIG_CPU_FREQ_DEFAULT_GOV_ONDEMAND is not set
# CONFIG_CPU_FREQ_DEFAULT_GOV_CONSERVATIVE is not set
# CONFIG_CPU_FREQ_DEFAULT_GOV_SCHEDUTIL is not set
CONFIG_CPU_FREQ_GOV_PERFORMANCE=y
CONFIG_CPU_FREQ_GOV_POWERSAVE=m
CONFIG_CPU_FREQ_GOV_USERSPACE=m
CONFIG_CPU_FREQ_GOV_ONDEMAND=m
CONFIG_CPU_FREQ_GOV_CONSERVATIVE=m
CONFIG_CPU_FREQ_GOV_SCHEDUTIL=y


I was just able to boot with intel_pstate=passive literally 10 times in a row (connected to AC) which is very encouraging.

$ journalctl --list-boots
...
 -10 8b4e176c56c74d6ab0b94a48e2b7a859 Tue 2018-07-03 19:35:33 EDT—Tue 2018-07-03 19:36:00 EDT
  -9 e3e6ffc7fa3744269454e83e74a90ee7 Tue 2018-07-03 19:36:41 EDT—Tue 2018-07-03 19:37:26 EDT
  -8 7b54ba3436334ce483f0ac804bbccf05 Tue 2018-07-03 19:38:08 EDT—Tue 2018-07-03 19:38:36 EDT
  -7 ea41045d5cc444349d36e0141489ce1e Tue 2018-07-03 19:39:19 EDT—Tue 2018-07-03 19:39:56 EDT
  -6 7fc80a4270324dd8ada6dd8ea66dfcea Tue 2018-07-03 19:40:36 EDT—Tue 2018-07-03 19:41:11 EDT
  -5 400827dda7b041318b28040977670d7f Tue 2018-07-03 19:41:48 EDT—Tue 2018-07-03 19:42:15 EDT
  -4 066d23f2d5f04712af19fe67159c591e Tue 2018-07-03 19:42:57 EDT—Tue 2018-07-03 19:43:26 EDT
  -3 688d37a799b0404ea78b19cc0fca64ec Tue 2018-07-03 19:44:06 EDT—Tue 2018-07-03 19:45:47 EDT
  -2 bcda0e20d07541acb2d955b1ef4619d2 Tue 2018-07-03 19:46:23 EDT—Tue 2018-07-03 19:46:42 EDT
  -1 31e39c62d9ad48eda5f8f30094797221 Tue 2018-07-03 19:50:15 EDT—Tue 2018-07-03 19:50:54 EDT
   0 eab59bb0608148bcacf37c5355512571 Tue 2018-07-03 19:51:29 EDT—Tue 2018-07-03 19:52:09 EDT

$ cat /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor
powersave
powersave
powersave
powersave

$ cat /sys/devices/system/cpu/cpu*/cpufreq/scaling_driver 
intel_cpufreq
intel_cpufreq
intel_cpufreq
intel_cpufreq

$ cat /proc/cmdline 
initrd=\intel-ucode.img initrd=\initramfs-linux-custom.img rw rd.luks.uuid=145628bb-0138-4b8b-bc94-2d041c756539 rd.luks.name=145628bb-0138-4b8b-bc94-2d041c756539=lvm root=/dev/lvmvg/root quiet intel_pstate=passive dyndbg="file processor_perflib.c +p"

I'll post a dmesg and journal from the last boot as well in case it's useful.

Comment 43 carbonchauvinist 2018-07-04 00:07:39 UTC

Created attachment 277161 [details]
dmesg intel-cpufreq powersave default governor boot intel_pstate=passive dyndbg="file processor_perflib.c +p"

this is the dmesg from booting 4.17.3-1 with intel-cpufreq (intel_pstate=passive dyndbg="file processor_perflib.c +p") and default governor set to powersave

Comment 44 carbonchauvinist 2018-07-04 00:12:18 UTC

Created attachment 277163 [details]
journal intel-cpufreq powersave default governor boot intel_pstate=passive dyndbg="file processor_perflib.c +p"

this is the journal from booting 4.17.3-1 with intel-cpufreq (intel_pstate=passive dyndbg="file processor_perflib.c +p") and default governor set to powersave

Comment 45 Srinivas Pandruvada 2018-07-05 20:29:00 UTC

This confirms that platform has thermal issues during boot (usually during boot because of extra processing, we will see higher CPU temperature). So only likes less aggressive governor during boot (But how the acpi-cpufreq with performance governor as default has no issue, this is a mystery).

Since you have changed base thermal by adding addition or replace NVMe, don't know if it changed the thermal model, the system was built for.

I am waiting for Mario from Dell to confirm if he sees the same issue on this model.

I can build a patch to allow disable turbo during boot and allow later from sysfs, to be enabled.

Comment 46 Doug Smythies 2018-07-05 20:45:40 UTC

(In reply to Srinivas Pandruvada from comment #45)
> This confirms that platform has thermal issues during boot (usually during
> boot because of extra processing, we will see higher CPU temperature).

I think that also a race condition is a possibility, where forcing the slower clock rate during boot made the condition not occur.

Note also, that one of the OPs turbostat logs seemed to indicate that full load on all CPU's was O.K., with processor package temperature only getting to 85 degrees.

Comment 47 Srinivas Pandruvada 2018-07-05 22:29:06 UTC

Created attachment 277207 [details]
boot kernel command line turbo disable

@Doug
Possible also. Going high during boot time than runtime is not unusual because of uncore initialization, which get poked i915 and other PCI devices.

Attached a patch;
Boot by adding to kernel command line, and later change it from intel_pstate sysfs (/sys/devices/system/cpu/intel_pstate/no_turbo)

intel_pstate=no_turbo

Comment 48 carbonchauvinist 2018-07-06 00:06:31 UTC

(In reply to Srinivas Pandruvada from comment #45)
> Since you have changed base thermal by adding addition or replace NVMe,
> don't know if it changed the thermal model, the system was built for.

Just a small note, I only added the m.2 NVMe to this system in May and importantly I was also having the issues prior to the addition. 

$ lsblk
NAME             FSTYPE      TYPE  MOUNTPOINT
sda                          disk  
├─sda1           vfat        part  /boot
└─sda2           crypto_LUKS part  
  └─lvm          LVM2_member crypt 
    ├─lvmvg-root ext4        lvm   /
    ├─lvmvg-var  ext4        lvm   /var
    └─lvmvg-home ext4        lvm   /home
nvme0n1                      disk  
├─nvme0n1p1      ntfs        part  
├─nvme0n1p2      vfat        part  
├─nvme0n1p3                  part  
└─nvme0n1p4      ntfs        part

I run linux from the samsung evo 850 sata and don't access the m.2 at all when booted if that matters.
 
Also since changing to default powersave governor I've even been able to run with intel_pstate fully enabled now for close to two days of uptime. I've been suspending and resuming without issue (all connected to AC). Of course performance is noticeably slower.

$ uptime
 19:46:03 up 1 day, 22:03,  1 user,  load average: 0.47, 0.53, 0.51

$ cat /proc/cmdline 
initrd=\intel-ucode.img initrd=\initramfs-linux-custom.img rw rd.luks.uuid=145628bb-0138-4b8b-bc94-2d041c756539 rd.luks.name=145628bb-0138-4b8b-bc94-2d041c756539=lvm root=/dev/lvmvg/root quiet dyndbg="file processor_perflib.c +p"

$ cat /sys/devices/system/cpu/cpu0/cpufreq/scaling_driver 
intel_pstate

$ cat /sys/devices/system/cpu/cpu0/cpufreq/scaling_governor
powersave

> I can build a patch to allow disable turbo during boot and allow later from
> sysfs, to be enabled.

Will try your patch and report back.

Comment 49 carbonchauvinist 2018-07-06 01:45:55 UTC

I'm not sure if I did something wrong or not. I've added your patch to vanilla kernel, but get an error in one out of the three hunks when trying to build:

$ makepkg -f
...
patching file drivers/cpufreq/intel_pstate.c
Hunk #1 FAILED at 297.
Hunk #2 succeeded at 2042 (offset -221 lines).
Hunk #3 succeeded at 2340 (offset -229 lines).
1 out of 3 hunks FAILED -- saving rejects to file drivers/cpufreq/intel_pstate.c.rej
==> ERROR: A failure occurred in prepare().
    Aborting...

$ cat $(locate intel_pstate.c.rej)
--- drivers/cpufreq/intel_pstate.c
+++ drivers/cpufreq/intel_pstate.c
@@ -297,6 +297,7 @@ static int hwp_active __read_mostly;
 static int hwp_mode_bdw __read_mostly;
 static bool per_cpu_limits __read_mostly;
 static bool hwp_boost __read_mostly;
+static bool global_no_turbo __read_mostly;
 
 static struct cpufreq_driver *intel_pstate_driver __read_mostly;

Comment 50 Srinivas Pandruvada 2018-07-06 05:36:10 UTC

What baseline you are using? I can make a patch for that.
Or you can just add the line manually in your partially merged intel_pstate.c
static bool global_no_turbo __read_mostly;

Comment 51 carbonchauvinist 2018-07-06 12:01:28 UTC

I'm using the 4.17.3-1 baseline, my apologies if I misunderstood. Which did you patch? I can move to that instead.

Comment 52 Srinivas Pandruvada 2018-07-06 17:01:36 UTC

Created attachment 277233 [details]
disable_boot_turbo_option

ported to 4.17

Comment 53 carbonchauvinist 2018-07-06 17:03:47 UTC

Created attachment 277235 [details]
dmesg turbo_disable_patch 4.18.0-rc3 intel_pstate=no_turbo

Okay, got your patch applied to mainline (4.18.0-rc3) and just booted successfully (see attached dmesg from boot).

$ uname -a
Linux 5510 4.18.0-rc3-custom #1 SMP PREEMPT Fri Jul 6 11:56:18 EDT 2018 x86_64 GNU/Linux

$ cat /sys/devices/system/cpu/intel_pstate/no_turbo 
1

$ cat /proc/cmdline 
initrd=\intel-ucode.img initrd=\initramfs-linux-custom.img rw rd.luks.uuid=145628bb-0138-4b8b-bc94-2d041c756539 rd.luks.name=145628bb-0138-4b8b-bc94-2d041c756539=lvm root=/dev/lvmvg/root quiet intel_pstate=no_turbo dyndbg="file processor_perflib.c +p"

$ cat /sys/devices/system/cpu/cpu0/cpufreq/scaling_driver 
intel_pstate

$ cat /sys/devices/system/cpu/cpu0/cpufreq/scaling_governor
powersave

$ dmesg | grep -i turbo
[    0.000000] Command line: initrd=\intel-ucode.img initrd=\initramfs-linux-custom.img rw rd.luks.uuid=145628bb-0138-4b8b-bc94-2d041c756539 rd.luks.name=145628bb-0138-4b8b-bc94-2d041c756539=lvm root=/dev/lvmvg/root quiet intel_pstate=no_turbo dyndbg="file processor_perflib.c +p"
[    0.000000] intel_pstate: Turbo disabled by boot options
[    0.000000] Kernel command line: initrd=\intel-ucode.img initrd=\initramfs-linux-custom.img rw rd.luks.uuid=145628bb-0138-4b8b-bc94-2d041c756539 rd.luks.name=145628bb-0138-4b8b-bc94-2d041c756539=lvm root=/dev/lvmvg/root quiet intel_pstate=no_turbo dyndbg="file processor_perflib.c +p"

Assuming all that looks correct and as expected I will continue to test make sure it can boot without issue and I'll run some increased loads to log with turbostat after making sure to enable turbo manually (# echo 0 > /sys/devices/system/cpu/intel_pstate/no_turbo) after booting.

Comment 54 Srinivas Pandruvada 2018-07-06 19:35:13 UTC

So somehow the faster cpu frequencies either causing thermal or triggering some race condition in some code path.

Comment 55 carbonchauvinist 2018-07-06 20:54:25 UTC

Created attachment 277255 [details]
journalctl -k 4.18.0-rc3 intel_pstate=no_turbo freeze

journalctl -k 4.18.0-rc3 intel_pstate=no_turbo freeze

With the patch applied to 4.18.0-rc3 and booting with intel_pstate=no_turbo I can get to the desktop without issue.

I can run an increased load test (i.e. compile gcc5) just fine with turbo disabled.

However, when trying to run increased load test (compile gcc5) after manually re-enabling turbo (# echo 0 > /sys/devices/system/cpu/intel_pstate/no_turbo) I've gotten a hard freeze every time.

In fact, just re-enabling turbo seems to be enough to cause a freeze - even without any increased load. 

I've attached the journal (kernel messages) from one of the boots where this happened for reference (as soon as I manually re-enabled turbo, I then pulled up my Plasma Application Dashboard and it froze before I could make a selection). 

Should I try the patch on 4.17.3-1 too?

Comment 56 carbonchauvinist 2018-07-06 21:27:09 UTC

Created attachment 277257 [details]
journalctl -k 4.17.3-1

4.17.3-1 kernel also freezes with applied patch and when manually enabling turbo after booting with intel_pstate=no_turbo parameter (see journalctl kernel messages attached).

If I don't enable turbo after boot, so far, I haven't had any freezes even under increased load.

Comment 57 carbonchauvinist 2018-07-06 21:32:10 UTC

Created attachment 277259 [details]
journalctl -k 4.17.3-1 patch intel_pstate=no_turbo - freeze when turbo enabled

*correctted attachment

4.17.3-1 kernel also freezes with applied patch and when manually enabling turbo after booting with intel_pstate=no_turbo parameter (see journalctl kernel messages attached).

If I don't enable turbo after boot, so far, I haven't had any freezes even under increased load.

Comment 58 Doug Smythies 2018-07-07 15:32:45 UTC

(In reply to carbonchauvinist from comment #55)

> However, when trying to run increased load test (compile gcc5) after
> manually re-enabling turbo (# echo 0 >
> /sys/devices/system/cpu/intel_pstate/no_turbo) I've gotten a hard freeze
> every time.
> 
> In fact, just re-enabling turbo seems to be enough to cause a freeze - even
> without any increased load. 

Interesting. How quickly does it freeze without any increased load? Myself, I still think there could be value added in trying older kernels.

From comment 37:

> Is there a better way to capture the turbostat output?
> I've been using "sudo turbostat --debug > turbostat.out 2>&1"
> but it seems to truncate (or not capture all) the output?

The output looks O.K. to me. Note that turbostat doesn't list some package specific stuff for CPUs > 0.

Myself, and for what we are looking into, I'd be tempted to run something like this:

sudo turbostat --quiet --Summary --show Busy%,Bzy_MHz,IRQ,PkgTmp,PkgWatt,GFXWatt,CorWatt --interval 1

Comment 59 Srinivas Pandruvada 2018-07-08 00:10:11 UTC

Try one more thing:
After booting with no_turbo option

run a turbostat with options as Doug suggested. But remove --quiet option

$ sudo turbostat --show Busy%,Bzy_MHz,IRQ,PkgTmp,PkgWatt,GFXWatt,CorWatt --interval 1

Once you enable turbo, you are seeing freeze so probably you can't run turbostat. If you can run with some load.


Attach the output.

Then reboot by just adding just one kernel parameter
cpufreq.off=1

Read and write some MSRs. You will need msr-util package. Attach the output of all of these.

#rdmsr -a 0x770
#wrmsr -a 0x770 0x01
#rdmsr -a 0x771
#rdmsr -a 0x774
#wrmsr -a 0x774 0x80001a08

Collect at this point all the outputs of command issued as next command may cause freeze.
#wrmsr -a 0x774 0x80002308

If doesn't freeze then run some workload and again run turbostat with above options with some load.

If it still causes freeze, then go back to some older kernel and try and see if it doesn't freeze. If it is we have to do bisect.

Comment 60 carbonchauvinist 2018-07-08 15:24:55 UTC

(In reply to Doug Smythies from comment #58)
> Interesting. How quickly does it freeze without any increased load? 

It's not an every time thing unfortunately. The one time I can specifically recall (sorry, there have been a lot of boots and tests recently): I'd only pulled up my full screen Plasma Application Dashboard and it froze.

Also to further clarify, once enabling turbo after booting, I am sometimes able to run an increased load without freezes. There doesn't appear to be any rhyme or reason as to why sometimes it freezes and others it doesn't. The ratio seems to be approx. 7:3 (freezes:no-freezes).

In fact one time while testing, after booting, I echoed "0" to enable turbo, then caused increase load (compile gcc5) and there was no freeze initially (normally happens during the "Extracting gcc-5.5.0.tar.xz with bsdtar" step). I cancelled out of the compile, re-echoed "0" to /sys/devices/system/cpu/intel_pstate/no_turbo and when I went to compile gcc5 again, it did freeze - and again at the "extracting... with bsdtar" step.

Clear as mud, I'm sure.

Comment 61 carbonchauvinist 2018-07-08 15:36:55 UTC

Created attachment 277279 [details]
turbostat 4.17.3-1 intel_pstate=no_turbo (increased load included)

(In reply to Srinivas Pandruvada from comment #59)
> Try one more thing:
> After booting with no_turbo option
> 
> run a turbostat with options as Doug suggested. But remove --quiet option
> 
> $ sudo turbostat --show Busy%,Bzy_MHz,IRQ,PkgTmp,PkgWatt,GFXWatt,CorWatt
> --interval 1

ran turbostat with requested options and increased load during test

Comment 62 carbonchauvinist 2018-07-08 15:41:43 UTC

Created attachment 277281 [details]
turbostat 4.17.3-1 intel_pstate=no_turbo - enabled turbo (increased load included)

(In reply to Srinivas Pandruvada from comment #59)
> Try one more thing:
> After booting with no_turbo option
>...
> Once you enable turbo, you are seeing freeze so probably you can't run
> turbostat. If you can run with some load.
> 
> 
> Attach the output.

booted with no_turbo option, enabled turbo, increased load, this is one of the instances where it did not freeze.

Comment 63 carbonchauvinist 2018-07-08 15:56:01 UTC

Created attachment 277283 [details]
turbostat 4.17.3-1 intel_pstate=no_turbo cpufreq.off=1 custom MSR values with increased load

(In reply to Srinivas Pandruvada from comment #59)
> Then reboot by just adding just one kernel parameter
> cpufreq.off=1

Assuming I'm understanding you correctly, I'm booted with the following cmdline:

$ journalctl -b -1 | grep Command
Jul 08 10:33:41 archlinux kernel: Command line: initrd=\intel-ucode.img initrd=\initramfs-linux-custom.img rw rd.luks.uuid=145628bb-0138-4b8b-bc94-2d041c756539 rd.luks.name=145628bb-0138-4b8b-bc94-2d041c756539=lvm root=/dev/lvmvg/root quiet intel_pstate=no_turbo cpufreq.off=1 dyndbg="file processor_perflib.c +p"

> Read and write some MSRs. You will need msr-util package. Attach the output
> of all of these.
> 
> #rdmsr -a 0x770
> #wrmsr -a 0x770 0x01
> #rdmsr -a 0x771
> #rdmsr -a 0x774
> #wrmsr -a 0x774 0x80001a08
> 
> Collect at this point all the outputs of command issued as next command may
> cause freeze.
> #wrmsr -a 0x774 0x80002308

Again, assuming I did this correctly, here's the output of the above commands (I tried to re-read the addresses after writing to confirm they'd changed, but I may be misunderstanding what's happening here):

[root@5510 ghost]# rdmsr -a 0x770
[root@5510 ghost]# wrmsr -a 0x770 0x01
[root@5510 ghost]# 
[root@5510 ghost]# rdmsr -a 0x770
[root@5510 ghost]# 
[root@5510 ghost]# 
[root@5510 ghost]# rdmsr -a 0x771
[root@5510 ghost]# 
[root@5510 ghost]# 
[root@5510 ghost]# rdmsr -a 0x774
[root@5510 ghost]# wrmsr -a 0x774 0x80001a08
[root@5510 ghost]# 
[root@5510 ghost]# rdmsr -a 0x774
[root@5510 ghost]# 
[root@5510 ghost]# 
[root@5510 ghost]# 
[root@5510 ghost]# wrmsr -a 0x774 0x80002308
[root@5510 ghost]# 
[root@5510 ghost]# rdmsr -a 0x774
[root@5510 ghost]#

 
> If doesn't freeze then run some workload and again run turbostat with above
> options with some load.

I've attached the turbostat output of an increased workload for your review

Comment 64 Doug Smythies 2018-07-08 17:47:43 UTC

(In reply to carbonchauvinist from comment #63)

> > Read and write some MSRs. You will need msr-util package. Attach the output
> > of all of these.
> > 
> > #rdmsr -a 0x770
> > #wrmsr -a 0x770 0x01
> > #rdmsr -a 0x771
> > #rdmsr -a 0x774
> > #wrmsr -a 0x774 0x80001a08
> > 
> > Collect at this point all the outputs of command issued as next command may
> > cause freeze.
> > #wrmsr -a 0x774 0x80002308
> 
> Again, assuming I did this correctly, here's the output of the above
> commands (I tried to re-read the addresses after writing to confirm they'd
> changed, but I may be misunderstanding what's happening here):
> 
> [root@5510 ghost]# rdmsr -a 0x770
> [root@5510 ghost]# wrmsr -a 0x770 0x01
> [root@5510 ghost]# 
> [root@5510 ghost]# rdmsr -a 0x770
> [root@5510 ghost]# 
> [root@5510 ghost]# 
> [root@5510 ghost]# rdmsr -a 0x771
> [root@5510 ghost]# 
> [root@5510 ghost]# 
> [root@5510 ghost]# rdmsr -a 0x774
> [root@5510 ghost]# wrmsr -a 0x774 0x80001a08
> [root@5510 ghost]# 
> [root@5510 ghost]# rdmsr -a 0x774
> [root@5510 ghost]# 
> [root@5510 ghost]# 
> [root@5510 ghost]# 
> [root@5510 ghost]# wrmsr -a 0x774 0x80002308
> [root@5510 ghost]# 
> [root@5510 ghost]# rdmsr -a 0x774
> [root@5510 ghost]#
> 

Ya, something is wrong. It looks to me as though the msr module is not loaded (and note that the commands to not thow an error when this occurs). turbostat will load it, or you can load it manually:

sudo modprobe msr

Then you should get some results from rdmsr commands. Example (I do not have HWP, so use pstates as my example (idle system)):

$ sudo rdmsr --bitfield 15:8 -d -a 0x198
16
16
16
16
16
16
16
16

Comment 65 carbonchauvinist 2018-07-08 18:27:13 UTC

Created attachment 277291 [details]
*CORRECTED* turbostat 4.17.3-1 intel_pstate=no_turbo cpufreq.off=1 custom MSR values with increased load *CORRECTED*

(In reply to Srinivas Pandruvada from comment #59)
> Then reboot by just adding just one kernel parameter
> cpufreq.off=1


Booted with the following cmdline:

$ journalctl -b -1 | grep Command
Jul 08 10:33:41 archlinux kernel: Command line: initrd=\intel-ucode.img initrd=\initramfs-linux-custom.img rw rd.luks.uuid=145628bb-0138-4b8b-bc94-2d041c756539 rd.luks.name=145628bb-0138-4b8b-bc94-2d041c756539=lvm root=/dev/lvmvg/root quiet intel_pstate=no_turbo cpufreq.off=1 dyndbg="file processor_perflib.c +p"

> Read and write some MSRs. You will need msr-util package. Attach the output
> of all of these.
> 
> #rdmsr -a 0x770
> #wrmsr -a 0x770 0x01
> #rdmsr -a 0x771
> #rdmsr -a 0x774
> #wrmsr -a 0x774 0x80001a08
> 
> Collect at this point all the outputs of command issued as next command may
> cause freeze.
> #wrmsr -a 0x774 0x80002308


Thanks to Doug's help, here's the CORRECTED output of the above commands:

$ sudo su
[root@5510 ghost]# 
[root@5510 ghost]# 
[root@5510 ghost]# rdmsr -a 0x770
0
0
0
0
[root@5510 ghost]# wrmsr -a 0x770 0x01
[root@5510 ghost]# rdmsr -a 0x770
1
1
1
1
[root@5510 ghost]# 
[root@5510 ghost]# 
[root@5510 ghost]# rdmsr -a 0x771
1081a23
1081a23
1081a23
1081a23
[root@5510 ghost]# 
[root@5510 ghost]# 
[root@5510 ghost]# 
[root@5510 ghost]# rdmsr -a 0x774
8000ff01
8000ff01
8000ff01
8000ff01
[root@5510 ghost]# wrmsr -a 0x774 0x80001a08
[root@5510 ghost]# rdmsr -a 0x774
80001a08
80001a08
80001a08
80001a08
[root@5510 ghost]# 
[root@5510 ghost]# 
[root@5510 ghost]# wrmsr -a 0x774 0x80002308
[root@5510 ghost]# rdmsr -a 0x774
80002308
80002308
80002308
80002308
[root@5510 ghost]#

 
> If doesn't freeze then run some workload and again run turbostat with above
> options with some load.

I was able to run my usual increased load test (compile gcc5) and have attached the output here. I did get a hang once I completed the test and attempted to open a tab in the Falkon browser.

Comment 66 Srinivas Pandruvada 2018-07-08 19:18:11 UTC

So you did get a freeze even without using intel_pstate? Basically taking platform to take turbo causes some issue in other part of the code. May be it is just UI hang? If you ssh into this platform from other computer it may be still responding.

Just boot ubuntu or Fedora from a USB disk and check if you see freeze. Fedora usually migrates to prettu latest kernel.

Comment 67 carbonchauvinist 2018-07-08 20:30:18 UTC

(In reply to Srinivas Pandruvada from comment #66)
> So you did get a freeze even without using intel_pstate? Basically taking
> platform to take turbo causes some issue in other part of the code. May be
> it is just UI hang? If you ssh into this platform from other computer it may
> be still responding.
> 
> Just boot ubuntu or Fedora from a USB disk and check if you see freeze.
> Fedora usually migrates to prettu latest kernel.

So I just retested five times in a row still on the 4.17.3-1-custom built kernel (boot with intel_pstate=no_turbo cpufreq.off=1, load msr module, write msr values as instructed, increase load by compiling gcc5) and got a hard freeze each time.

In each instance I can't ssh in (confirmed I could ssh in normally prior to each hang for each of the five boots i tested), I also can't switch ttys. In some instances I get a flashing caps lock key in other's I don't. I have to hold the power button down for 3-5 seconds to hard power off the machine.

I'll boot from latest fedora and/or ubuntu and report back shortly.

Comment 68 carbonchauvinist 2018-07-08 21:32:47 UTC

Can confirm that booting with latest fedora live usb produces freezes on my machine (four boots and four freezes in a row). 

Though I've been able to make it through the boot process in each one (the freezes happen after logging into desktop). 

I was able to run some simple stress tests ( stress -c 4 ) for some time during one of the boots without a hang, but then hung later while just changing directory in terminal:

$ uname -a
Linux localhost-live 4.16.3-301.fc28.x86_64 #1 SMP Mon Apr 23 21:59:58 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux

intel_pstate was scaling_driver, no_turbo was set to "0", powersave was the scaling_governor.

I'll try ubuntu also.

Comment 69 carbonchauvinist 2018-07-08 22:29:24 UTC

Same with Ubuntu's latest live iso (18.04 LTS), had a couple of freezes during the boot process, also had one when I was able to get booted just trying to cat some files in the terminal.

$ uname -a
Linux ubuntu 4.15.0-20-generic #21-Ubuntu SMP Tue Apr 24 06:16:15 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux

Comment 70 Srinivas Pandruvada 2018-07-09 02:56:54 UTC

This issue is happening of irrespective of intel_pstate driver. Enabling turbo is getting issue on this platform. Even if HWP is disabled (which these MSRs are enabling), you still see the issue. Also this is happening for kernel as old as 4.15 kernel.

You are seeing too many BUG for some wireless driver, can you fix that noise? May be rmmod the module and see if this has anything to do.

Since we can't get logs after issue occurs, it is difficult to see where it was stuck.
Try enabling CONFIG_MAGIC_SYSRQ. Check here and see if you can make the kernel respond after the freeze.
https://www.kernel.org/doc/html/latest/admin-guide/sysrq.html

Comment 71 Doug Smythies 2018-07-09 14:45:48 UTC

(In reply to Srinivas Pandruvada from comment #70)
> This issue is happening of irrespective of intel_pstate driver. Enabling
> turbo is getting issue on this platform.

I agree that the root issue here appears to have nothing to the intel_pstate CPU frequency scaling driver, and everything to do with turbo enabled or disabled, data from attachment 277141 [details] (which maybe didn't run long enough) not withstanding.

Comment 72 Srinivas Pandruvada 2018-07-09 16:31:47 UTC

Two issues to note here:
- This laptop was sold with Ubuntu Linux as claimed. But I haven't seen any reports from Canonical or others.
- Some version used to boot on this machine, can we try that version again? This will help if there is some HW issue on this platform lately. Also if we have to bisect we need to have some good version. Magic sys key request if works, then may help in debug.

Comment 73 carbonchauvinist 2018-07-10 13:42:28 UTC

(In reply to Srinivas Pandruvada from comment #70)
> This issue is happening of irrespective of intel_pstate driver. Enabling
> turbo is getting issue on this platform. Even if HWP is disabled (which
> these MSRs are enabling), you still see the issue. Also this is happening
> for kernel as old as 4.15 kernel.

I see, thanks for your time and efforts working through to verify this; at least we know where the problem isn't right now - which is more than we knew before.

> You are seeing too many BUG for some wireless driver, can you fix that
> noise? May be rmmod the module and see if this has anything to do.

I think these bugs are related to a recent switch to iwd from wpa_supplicant and should not be related to my freezing issues as the freezing occurred prior to installing iwd and continued after installing iwd. I can certainly switch off iwd and go back to wpa_supplicant if you think it's worthwhile. 

> Since we can't get logs after issue occurs, it is difficult to see where it
> was stuck.
> Try enabling CONFIG_MAGIC_SYSRQ. Check here and see if you can make the
> kernel respond after the freeze.
> https://www.kernel.org/doc/html/latest/admin-guide/sysrq.html

I will work on implementing this here soon and report back with any findings.

(In reply to Srinivas Pandruvada from comment #72)
> Two issues to note here:
> - This laptop was sold with Ubuntu Linux as claimed. But I haven't seen any
> reports from Canonical or others.

I know this particular build of the Precision 5510 occupied a gray area as most who would buy this machine would want an i7 (or at least an i5 with HT) along with a dedicated graphics card as it's marketed as a Workstation class laptop. I bought this specifically because I didn't want HT (hence the i5-6440HQ) and I didn't want an nvidia dedicated card to struggle with when using Linux. There may not have been that many of this particular build sold, and maybe even less where they're running Linux is my only guess.

I have seen a small sample of anecdotal reports of freezing/hanging when searching across various fora, but it's hard to get a gauge on how prevalent it is or if it's related to this turbo issue.

I do know there's been a lot of posts about the BIOS on this laptop being really buggy and most have reached the conclusion that Dell has decided not to dedicate any resources to addressing which is unfortunate to say the least.

> - Some version used to boot on this machine, can we try that version again?
> This will help if there is some HW issue on this platform lately. Also if we
> have to bisect we need to have some good version. Magic sys key request if
> works, then may help in debug.

I can certainly try to do this, but this will be a massive undertaking. I can access older versions of the kernel using Arch Archive, but downgrading to say kernel 4.1x and all the other package downgrades that will need to happen along with that is more than a notion. I'll see what this entails though. 

I may be able to parse the journal to see if I can find clues as to when this started happening and then use that to base which version of the kernel to try.

Comment 74 Doug Smythies 2018-07-10 13:55:37 UTC

(In reply to carbonchauvinist from comment #73)

> I can certainly try to do this, but this will be a massive undertaking. I
> can access older versions of the kernel using Arch Archive, but downgrading
> to say kernel 4.1x and all the other package downgrades that will need to
> happen along with that is more than a notion. I'll see what this entails
> though. 

You shouldn't have to downgrade any other packages, just try the other kernels. I have kernels from 4.4 through 4.18-rc4 on my test computer right now. Finding a good starting point kernel for a bisection is not hard. The bisection itself is not hard, but very tedious and time consuming.

Comment 75 carbonchauvinist 2018-07-10 18:55:38 UTC

(In reply to Doug Smythies from comment #71)
> (In reply to Srinivas Pandruvada from comment #70)
> > This issue is happening of irrespective of intel_pstate driver. Enabling
> > turbo is getting issue on this platform.
> 
> I agree that the root issue here appears to have nothing to the intel_pstate
> CPU frequency scaling driver, and everything to do with turbo enabled or
> disabled, data from attachment 277141 [details] (which maybe didn't run long
> enough) not withstanding.

My confusion is, because the issue is with Turbo, and I most certainly was hitting Turbo ranges in the past regardless of the kernel version I was on, why would it have only manifested (semi-)recently?

Might this have been introduced with a BIOS update? Is there any chance this issue is related to the buggy BIOS for this machine?

(In reply to Doug Smythies from comment #74)
> You shouldn't have to downgrade any other packages, just try the other
> kernels. I have kernels from 4.4 through 4.18-rc4 on my test computer right
> now. Finding a good starting point kernel for a bisection is not hard. The
> bisection itself is not hard, but very tedious and time consuming.

Thanks again for your time and assistance Doug. I will try to see if I can find anything in the journal that clues me into a good kernel version to start testing from and then will proceed as best I can.

Essentially, I'll just be looking for the kernel version that allows me to boot regularly (i.e. without intel_pstate=disable) successfully?

Comment 76 Doug Smythies 2018-07-11 21:40:22 UTC

(In reply to carbonchauvinist from comment #75)

> My confusion is, because the issue is with Turbo, and I most certainly was
> hitting Turbo ranges in the past regardless of the kernel version I was on,
> why would it have only manifested (semi-)recently?

Well, that is what we're trying to figure out.

> Might this have been introduced with a BIOS update?

I don't know.

> Is there any chance this
> issue is related to the buggy BIOS for this machine?

I don't know.

> Essentially, I'll just be looking for the kernel version that allows me to
> boot regularly (i.e. without intel_pstate=disable) successfully?

Whatever is your most reliable / probable indicator of the issue.

If you want to continue with a kernel bisection, that could help. However, often (it seems) once one finishes a bisection, then uses a search engine with the commit number, then one finds information about the issue (from my own experiences, about 3/4 of the time). Prior to the bisection, often the exact correct search terms are not known.

Another option is for you to simply disable turbo in your BIOS, and move on, checking new kernels every few months to see if the issue persists. i.e. leave it to somebody else to solve.

Comment 77 carbonchauvinist 2018-07-11 22:22:49 UTC

Created attachment 277327 [details]
4.18.13.-1 freeze on boot

Well I went all the way back to kernel 4.8.13-1 which is from around the time I first got the machine and I'm still having freezes on boot and once booted.

I went through the steps of enabling the magic syskey. The last couple of times i got a froze I tried to press the REISUB key sequence and got some output. Not sure if it's of use, couldn't transcribe or otherwise grab so took a pic.

Comment 78 Srinivas Pandruvada 2018-07-12 14:18:54 UTC

So something went wrong in the system. You are getting issue on an old kernel.
From attachment it seems that there was machine check error. These are error CPU throws if there is some HW error (some or recoverable some or not). This type of freeze will happen if an unrecoverable error happened.

I am not expert on machine check, so there may be some code of actual error.

Comment 79 carbonchauvinist 2018-08-15 13:36:35 UTC

I was never able to recreate the machine check error I received when booting kernel 4.8.13-1; it happened only once and did not happen again afterwards. 

I tried numerous other kernel versions moving up from 4.8.13-1 and still got the system freezes, but no more of the machine check errors.

I've taken Doug's advice and have simply turned off turbo in my BIOS - which has worked well enough for me at this point (though leaves a really bad taste in my mouth considering the amount of money spent on this laptop).

There was a new BIOS released for this laptop recently (1.8.0 released 8/14/18) which I applied hoping this would help address my issue. 

However, after applying the new BIOS successfully, and turning back on Turbo in the BIOS, I booted into the current stable ARCH kernel (4.17.14.arch1-1) and when running a increased load test (compiling gcc5) I received a hang almost immediately upon starting the compile.

I'll try reaching out to Dell Sputnik support to see if I get any recourse there. 

Otherwise, I'll continue to grit my teeth and run with turbo disabled in the BIOS.

Comment 80 Chen Yu 2019-09-09 09:10:06 UTC

@carbonchauvinist@protonmail.ch
Do you still see this issue using the latest upstream kernel after one year?
If yes, I'd suggested to test benchmark with turbo enabled, and with only CPU online(maxcpus=1),   and with graphic disabled(modprobe.blacklist=i915).  Or even to be more aggressive, you can rebuild the kernel with CONFIG_DRM not set.
Then running stress-ng to check if the issue is still there.

Comment 81 carbonchauvinist 2019-11-28 05:50:19 UTC

@Chen Yu
(In reply to Chen Yu from comment #80)
> Do you still see this issue using the latest upstream kernel after one year?

Thank you for your interest. Yes, unfortunately even now running the latest 5.3.13-arch1-1 kernel and issue is still present.

At this point I'm unable to boot at all if I have turbo enabled (hangs on vendor logo, no flicker to indicate that ever changed out of initrd or that early kms has occurred).

If disable pstate (intel_pstate=disable) I can boot with turbo enabled -- when this happens, once I'm logged into my DE it appears to have normal operation. However, I still experience complete freezes intermittently - especially when coming back from a suspend for instance.

I just worked around all this by disabling turbo in the UEFI.

> If yes, I'd suggested to test benchmark with turbo enabled, and with only
> CPU online(maxcpus=1),   and with graphic disabled(modprobe.blacklist=i915).

Booting with maxcpus=1 and modprobe.blacklist=i915 and I was able to switch to a TTY, log in successfully and run stress-ng for 90 mins straight without issue! Are there specific flags you want me to run stress-ng with? Trying to change maxcpus to any number other than 1 and I am unable to boot again.

Even when booting with modprobe.blacklist=i915 it does appear that i915 was loaded after all though? It shows up in lsmod when run in the same TTY I ran the stress-ng test in. Is the blacklist parameter supposed to stop it from being loaded in the initrd only? Or am I missing something?

Also, when booting with maxcpus=1 and modprobe.blacklist=i915 I'm unable to log in using my DM (SDDM) as after entering the password I get a hang (though it's not a hard hang as I can press the power button and the system shuts down. I can't however switch to another TTY when this happens). I'm not sure if that's due to any of the parameters being passed - maybe you expect that to happen?

> Or even to be more aggressive, you can rebuild the kernel with CONFIG_DRM
> not set.
> Then running stress-ng to check if the issue is still there.

Do you still this is a valuable test based on the information above? I'll try to give this a go and report back.

Comment 82 carbonchauvinist 2019-11-28 06:03:45 UTC

*Do you still think this is a valuable test*

Comment 83 carbonchauvinist 2019-11-28 16:17:30 UTC

@Chen Yu

Okay, so I've been able to test a little more. Here's more complete, and hopefully more useful, information:

1. When booting with turbo enabled, maxcpus=1, modprobe.blacklist=i915 and NOT disabling pstate driver -- I can successfully switch to a TTY log in and run stress-ng for long periods of time without issue (at least 90 mins).

I am unable to log into a session via SDDM though as it hangs -- doesn't appear to be a full panic or anything as I'm able to power off the computer by simply pressing the power button and I can see the shutdown happen. When the session is hung after trying to log on via SDDM, I am unable to switch to other TTYs.

2. When booting with turbo enabled, maxcpus=2 through maxcpus=3, modprobe.blacklist=i915, and NOT disabling pstate driver -- I can successfully switch to a TTY log in and run stress-ng for long periods of time without issue. (I've run for an hour without issue with maxcpus=2 and maxcpus=3 respectively.)

With maxcpus=2 through maxcpus=3 I am able to successfully log into a session via SDDM and I was able to suspend and resume without a hang/freeze (though I only tested the suspend cycle once with maxcpus=3).

3. When booting with turbo enabled, maxcpus=4, modprobe.blacklist=i915, and NOT disabling pstate driver -- I am unable to boot at all. I can't even get past the vendor logo as it hangs here. I don't even see a flicker, just the logo and then it immediately freezes.

Comment 84 Chen Yu 2020-07-04 04:59:39 UTC

(In reply to carbonchauvinist from comment #83)
> @Chen Yu
> 
> Okay, so I've been able to test a little more. Here's more complete, and
> hopefully more useful, information:
> 
> 1. When booting with turbo enabled, maxcpus=1, modprobe.blacklist=i915 and
> NOT disabling pstate driver -- I can successfully switch to a TTY log in and
> run stress-ng for long periods of time without issue (at least 90 mins).
> 
> I am unable to log into a session via SDDM though as it hangs -- doesn't
> appear to be a full panic or anything as I'm able to power off the computer
> by simply pressing the power button and I can see the shutdown happen. When
> the session is hung after trying to log on via SDDM, I am unable to switch
> to other TTYs.
> 
> 2. When booting with turbo enabled, maxcpus=2 through maxcpus=3,
> modprobe.blacklist=i915, and NOT disabling pstate driver -- I can
> successfully switch to a TTY log in and run stress-ng for long periods of
> time without issue. (I've run for an hour without issue with maxcpus=2 and
> maxcpus=3 respectively.)
> 
> With maxcpus=2 through maxcpus=3 I am able to successfully log into a
> session via SDDM and I was able to suspend and resume without a hang/freeze
> (though I only tested the suspend cycle once with maxcpus=3).
> 
> 3. When booting with turbo enabled, maxcpus=4, modprobe.blacklist=i915, and
> NOT disabling pstate driver -- I am unable to boot at all. I can't even get
> past the vendor logo as it hangs here. I don't even see a flicker, just the
> logo and then it immediately freezes.

Thanks very much for the information in detail. This appears to be a conflict between the graphic and high frequency. Could you boot with intel_pstate=disabled with turbo enabled in BIOS, and after boot into the system, enable the turbo manually by:

#  rdmsr -a 0x1a0
#  850089
check if bit38(turbo disabled) is set
# rdmsr 0x1a0 -f 38:38
# 0
if the bit38 is 1, it means the turbo is disabled(it should be the case in your case), then enable the turbo by setting bit38 to 0 using 
wrmsr -a 0x1a0 xxx(xxx should be the value you read via rdmsr with bit38 cleared)

Comment 85 carbonchauvinist 2020-07-13 22:41:53 UTC

(In reply to Chen Yu from comment #84)
> Thanks very much for the information in detail. This appears to be a
> conflict between the graphic and high frequency. Could you boot with
> intel_pstate=disabled with turbo enabled in BIOS, and after boot into the
> system, enable the turbo manually by:
> 
> #  rdmsr -a 0x1a0
> #  850089
> check if bit38(turbo disabled) is set
> # rdmsr 0x1a0 -f 38:38
> # 0
> if the bit38 is 1, it means the turbo is disabled(it should be the case in
> your case), then enable the turbo by setting bit38 to 0 using 
> wrmsr -a 0x1a0 xxx(xxx should be the value you read via rdmsr with bit38
> cleared)

Thank you. Just to be clear and I understand, you are recommending the following?
1. Enable turbo in BIOS
2. Boot with intel_pstate=disable

When doing so here is the msr value as read:

$ sudo rdmsr -a -x -0 0x1a0
0000000000850089
0000000000850089
0000000000850089
0000000000850089

Why do you think that turbo should be disabled in my case after steps 1 and 2 above? Are you assuming that I'm also disabling turbo via a kernel line boot parameter? Should I be?

Here's the current cmdline I'm passing

$ cat /proc/cmdline 
initrd=\intel-ucode.img initrd=\initramfs-dracut-linux.img root=UUID=4d59661c-0a7e-41ce-8d7b-6cdb367f6e91 rw sysrq_always_enabled=1 quiet intremap=off intel_pstate=disable

And here's the stats when booted showing turbo is in fact enabled as I would expect (unless I'm missing something):

$ type pstate && pstate
pstate is aliased to `grep . /sys/devices/system/cpu/cpu*/cpufreq/scaling_{driver,governor} /sys/devices/system/cpu/intel_pstate/no_turbo /sys/devices/system/cpu/cpufreq/boost'
/sys/devices/system/cpu/cpu0/cpufreq/scaling_driver:acpi-cpufreq
/sys/devices/system/cpu/cpu1/cpufreq/scaling_driver:acpi-cpufreq
/sys/devices/system/cpu/cpu2/cpufreq/scaling_driver:acpi-cpufreq
/sys/devices/system/cpu/cpu3/cpufreq/scaling_driver:acpi-cpufreq
/sys/devices/system/cpu/cpu0/cpufreq/scaling_governor:schedutil
/sys/devices/system/cpu/cpu1/cpufreq/scaling_governor:schedutil
/sys/devices/system/cpu/cpu2/cpufreq/scaling_governor:schedutil
/sys/devices/system/cpu/cpu3/cpufreq/scaling_governor:schedutil
grep: /sys/devices/system/cpu/intel_pstate/no_turbo: No such file or directory
/sys/devices/system/cpu/cpufreq/boost:1

As a possible aside, I have since moved to dracut initramfs -- while I was using mkinitcpio when originally reporting the issue.

Also please note in the past I was able to boot successfully when disabling pstate, but I would get hangs at times when trying to resume from suspends. According to Doug and Srinivas it was most likely a thermal event during those hangs.

Comment 86 Chen Yu 2020-07-14 00:24:27 UTC

(In reply to carbonchauvinist from comment #85)
> (In reply to Chen Yu from comment #84)
> > Thanks very much for the information in detail. This appears to be a
> > conflict between the graphic and high frequency. Could you boot with
> > intel_pstate=disabled with turbo enabled in BIOS, and after boot into the
> > system, enable the turbo manually by:
> > 
> > #  rdmsr -a 0x1a0
> > #  850089
> > check if bit38(turbo disabled) is set
> > # rdmsr 0x1a0 -f 38:38
> > # 0
> > if the bit38 is 1, it means the turbo is disabled(it should be the case in
> > your case), then enable the turbo by setting bit38 to 0 using 
> > wrmsr -a 0x1a0 xxx(xxx should be the value you read via rdmsr with bit38
> > cleared)
> 
> Thank you. Just to be clear and I understand, you are recommending the
> following?
> 1. Enable turbo in BIOS
> 2. Boot with intel_pstate=disable
> 
> When doing so here is the msr value as read:
> 
> $ sudo rdmsr -a -x -0 0x1a0
> 0000000000850089
> 0000000000850089
> 0000000000850089
> 0000000000850089
> 
> Why do you think that turbo should be disabled in my case after steps 1 and
> 2 above? Are you assuming that I'm also disabling turbo via a kernel line
> boot parameter? Should I be?
> 
It was a typo, I was intended to check the turbo impact, but after reading the thread again, it seems that we have already tested. So I'm trying to figure out what the result would be if we disable graphic. As you mentioned, the i915 was loaded anyway, could you check what is the dependency of i915 after the system has booted up? say, you can check by lsmod command, and there should be some dependency for i915, and we might need to unload them via boot command line,for example:
modprobe.blacklist=i915,cec,video,i2c_algo_bit,drm_kms_helper,drm
If the i915 is still loaded, the last try would be to disable CONFIG_DRM in the kernel config and re-compile the kernel.
BTW, besides i915 driver, do you have other graphic driver loaded in your system? We should unload them as well.

Comment 87 carbonchauvinist 2020-07-14 00:55:04 UTC

Thanks for clarification. I'm willing to try to completely disabling graphic.

$ lsmod | rg i915                                                                              
i915                 2617344  23
intel_gtt              24576  1 i915                                                                                                                                                            
i2c_algo_bit           16384  1 i915                                                                                                                                                            
drm_kms_helper        253952  1 i915                                                                                                                                                            
cec                    69632  2 drm_kms_helper,i915                                                                                                                                             
drm                   581632  8 drm_kms_helper,i915 

So I can try disabling {i915,intel_gtt,i2c_algo_bit,drm_kms_helper,cec,drm} IIUC.

I'm only using i915 driver, this is an onboard video only system no other graphics.

An additional data point, I booted in with intel_pstate disabled just now and was able to use the system for some time (~2 hrs) including a suspend and resume cycle and multiple short runs of stress-ng (~1 min duration x 4 or 5 times).

However, not too long after the resume (~5 mins) when browsing a random website received a freeze. 

Booting back up after that freeze, took me twice for gdm to load (first time froze before gdm loaded). When gdm did load on the second boot it froze immediately - according to journal it was:

Jul 13 20:27:55 lap kernel: #PF: supervisor read access in kernel mode                                                                                                                          
Jul 13 20:27:55 lap kernel: BUG: kernel NULL pointer dereference, address: 0000000000000070

I'll attach that journal here next in case it's helpful.

Comment 88 carbonchauvinist 2020-07-14 00:58:46 UTC

Created attachment 290259 [details]
intel_pstate=disable kernel 5.7.8-arch1-1 kernel null pointer dereference #pf supervisor read access

Comment 89 Chen Yu 2020-07-14 02:13:53 UTC

(In reply to carbonchauvinist from comment #87)
> Thanks for clarification. I'm willing to try to completely disabling graphic.
> 
> $ lsmod | rg i915                                                           
> 
> i915                 2617344  23
> intel_gtt              24576  1 i915                                        
> 
> i2c_algo_bit           16384  1 i915                                        
> 
> drm_kms_helper        253952  1 i915                                        
> 
> cec                    69632  2 drm_kms_helper,i915                         
> 
> drm                   581632  8 drm_kms_helper,i915 
> 
> So I can try disabling {i915,intel_gtt,i2c_algo_bit,drm_kms_helper,cec,drm}
> IIUC.
> 
> I'm only using i915 driver, this is an onboard video only system no other
> graphics.
> 
> An additional data point, I booted in with intel_pstate disabled just now
> and was able to use the system for some time (~2 hrs) including a suspend
> and resume cycle and multiple short runs of stress-ng (~1 min duration x 4
> or 5 times).
> 
Please help check if boot succeed with intel_pstate enabled(turbo enabled) + graphic disabled, to first narrow down if there's graphic confiliction. 
> However, not too long after the resume (~5 mins) when browsing a random
> website received a freeze. 
The suspend resume issue might be another problem, let's stick to the boot issue first.
> 
> Booting back up after that freeze, took me twice for gdm to load (first time
> froze before gdm loaded). When gdm did load on the second boot it froze
> immediately - according to journal it was:
> 
> Jul 13 20:27:55 lap kernel: #PF: supervisor read access in kernel mode      
> 
> Jul 13 20:27:55 lap kernel: BUG: kernel NULL pointer dereference, address:
> 0000000000000070
> 
> I'll attach that journal here next in case it's helpful.
Seems to be another boot issue. BTW, is it possible to use the latest upstream kernel (by compiling from source code rather than using the arch linux image directly?)

Comment 90 carbonchauvinist 2020-07-14 02:32:26 UTC

(In reply to Chen Yu from comment #89)
> Please help check if boot succeed with intel_pstate enabled(turbo enabled) +
> graphic disabled, to first narrow down if there's graphic confiliction. 

I will try this and report back. 

> The suspend resume issue might be another problem, let's stick to the boot
> issue first.

I agree.

> BTW, is it possible to use the latest
> upstream kernel (by compiling from source code rather than using the arch
> linux image directly?)

Yes I can try to do this.

Comment 91 carbonchauvinist 2020-07-15 13:32:39 UTC

Created attachment 290291 [details]
intel_pstate enabled; turbo enabled; graphic disabled; still kernel panic

(In reply to Chen Yu from comment #89)
> Please help check if boot succeed with intel_pstate enabled(turbo enabled) +
> graphic disabled, to first narrow down if there's graphic confiliction.

I was able to completely disable i915 module from loading. To do so I needed more than just the kernel command line parameter of `modprobe_blacklist=i915,cec....`

I also had to add the following to /etc/modprobe.d/modprobe.conf and make sure it was included in my dracut initramfs 
install i915 /bin/true
install intel_gtt /bin/true

Trying to simply blacklist by using
blacklist i915
blacklist intel_gtt
in the modprobe.conf file did NOT stop the module from loading.

Even with the i915 module NOT loaded and intel_pstate enabled I would only successfully get to a login prompt maybe 1 out of 10 times. The other 9 out of 10 times (approx.) would either freeze somewhere in the initramfs, before it got to the initramfs (i.e. ACPI errors were shown and then a freeze) or immediately at the dell vendor logo (i.e. no ACPI errors were printed to screen just the logo and an immediate freeze).

If I did get the login to load and logged in, I would also get freezes as well though the timing would vary from just a few seconds to up to 10 minutes before a freeze.

This picture captures one of those freezes (sorry no other way to capture the output of the freeze). As you can see i915 is not loaded (`lsmod | rg i915` returns nothing). I then cat the modprobe.conf file to show how I blacklisted the modules. Then just trying to run a random command - in this case using `fwupdmgr get-updates` - caused a kernel panic after spitting some ACPI errors out.

This is with the Arch kernel though - do you still think it is useful to try the latest upstream kernel as well? (Do you mean the latest STABLE upstream? i.e 5.7.8 Or just the latest upstream mainline i.e. 5.8.rc-5? And also would this just be the default config when compiling?)

Thanks again for your time in this - it's been really frustrating having to hobble this machine by disabling turbo. Disabling turbo in the BIOS though makes the machine runs like a dream with no hangs or freezes or panics.

Comment 92 Doug Smythies 2020-07-20 20:46:11 UTC

I think he means kernel 5.8.rc-5 (well -rc6 now).

Use the kernel configuration from your kernel 5.7.8-arch1-1 kernel, and just allow defaults from there, if required.

Since I thought we eliminated the intel_pstate CPU frequency scaling driver as an issue for this bug report, I don't really pay attention anymore. While I can not speak for Srinivas, that might be the case for him also.

Comment 93 Zhang Rui 2021-04-23 05:54:57 UTC

This is really a long thread, and it is a long time since last update.

Just want to confirm if my understanding of this problem is correct or not,
1. problem can not be reproduced with Turbo disabled in BIOS
2. problem can not be reproduced with Turbo disabled via kernel command line, using srinivas' patch.
3. problem can not be reproduced with powersave governor + passive mode

And all these suggests that there is a racing/thermal issue during boot that causes the failure.

I'm not sure how likely this is a racing issue because if it is, it should also be reproducible on other platforms as well. But Srinivas has the same model of laptop and could not reproduce the problem.

Thus I suspect this might be a unit failure on this machine only.


carbonchauvinist@protonmail.ch

can you please try the latest upstream and see if the symptom changes or not, and can you please remove the kernel commandlines like "quiet", "splash", etc, and then recapture the kernel panic screenshot?
I'm not sure if this kernel panic is related with this problem or not.

Comment 94 carbonchauvinist 2021-05-23 02:07:03 UTC

Created attachment 296951 [details]
5.13.0-rc2-1-mainline - mce when booting with turbo enabled

(In reply to Zhang Rui from comment #93)
> This is really a long thread, and it is a long time since last update.
> 
> Just want to confirm if my understanding of this problem is correct or not,
> 1. problem can not be reproduced with Turbo disabled in BIOS

Yes that is correct.

> 2. problem can not be reproduced with Turbo disabled via kernel command
> line, using srinivas' patch.

Yes, that is also correct.

> 3. problem can not be reproduced with powersave governor + passive mode

I was unable to reproduce the issue with powersave governor + passive mode.

> And all these suggests that there is a racing/thermal issue during boot that
> causes the failure.
> 
> I'm not sure how likely this is a racing issue because if it is, it should
> also be reproducible on other platforms as well. But Srinivas has the same
> model of laptop and could not reproduce the problem.
> 
> Thus I suspect this might be a unit failure on this machine only.

That honestly makes sense, especially if Srinivas has the exact same model and is unable to reproduce.

> carbonchauvinist@protonmail.ch
> 
> can you please try the latest upstream and see if the symptom changes or
> not, and can you please remove the kernel commandlines like "quiet",
> "splash", etc, and then recapture the kernel panic screenshot?
> I'm not sure if this kernel panic is related with this problem or not.

I installed the latest upstream as of today (5.13.0-rc2-1-mainline) booted without quiet kernel parameters and turbo re-enabled from the BIOS and sure enough first three boots froze. The fourth boot I was able to make it to gdm and login to gnome de.

Here's the picture captured showing the errors "mce: CPUs not responding to MCE broadcast (may include false positives): 0"

Interestingly, perhaps, that's only listed three times though I have a four-core proc. I do remember that if I booted with only three cores enabled I was able to boot with turbo enabled. So perhaps this is further evidence of unit failure for this machine only (affecting one core?)?

Let me know your thoughts, otherwise I'm willing to close this issue out (with tremendous thanks to all who assisted in troubleshooting for their valuable time and expertise).

Comment 95 Zhang Rui 2021-05-24 13:44:05 UTC

(In reply to carbonchauvinist from comment #94)
> > 
> > I'm not sure how likely this is a racing issue because if it is, it should
> > also be reproducible on other platforms as well. But Srinivas has the same
> > model of laptop and could not reproduce the problem.
> > 
> > Thus I suspect this might be a unit failure on this machine only.
> 
> That honestly makes sense, especially if Srinivas has the exact same model
> and is unable to reproduce.
> 
> > carbonchauvinist@protonmail.ch
> > 
> > can you please try the latest upstream and see if the symptom changes or
> > not, and can you please remove the kernel commandlines like "quiet",
> > "splash", etc, and then recapture the kernel panic screenshot?
> > I'm not sure if this kernel panic is related with this problem or not.
> 
> I installed the latest upstream as of today (5.13.0-rc2-1-mainline) booted
> without quiet kernel parameters and turbo re-enabled from the BIOS and sure
> enough first three boots froze. The fourth boot I was able to make it to gdm
> and login to gnome de.
> 
> Here's the picture captured showing the errors "mce: CPUs not responding to
> MCE broadcast (may include false positives): 0"
> 
> Interestingly, perhaps, that's only listed three times though I have a
> four-core proc. I do remember that if I booted with only three cores enabled
> I was able to boot with turbo enabled. So perhaps this is further evidence
> of unit failure for this machine only (affecting one core?)?
> 
> Let me know your thoughts, otherwise I'm willing to close this issue out
> (with tremendous thanks to all who assisted in troubleshooting for their
> valuable time and expertise).

Thanks for the update, carbonchauvinist.
I agree we can close this bug report. I don't have any further thoughts for this problems, from OS' perspective of view.

Note You need to log in before you can comment on or make changes to this bug.