Created attachment 276677 [details] hwinfo Dell Precision 5510 with latest available BIOS (1.7.0) running latest Arch stable kernel. SKylake i5-6440HQ with integrated HD 530 only. Symptoms are random freezes, hard locks which cause the caps lock key to blink and only able to perform a hard power reset. Journalctl does not appear to capture any logs when this happens. I'm using i915 Intel driver (not modesetting) and have it loading early in initramfs (passed as module in mkinitcpio.conf) and sometimes the freezes occur during boot. Disabling intel_pstate and falling back on acpi-cpufreq resolves the issues for me and I experience no more freezing. I've tried intel_pstate=passive and intel_pstate=no_hwp to no effect, only completely disabling works. $ cat /proc/cmdline initrd=\intel-ucode.img initrd=\initramfs-linux.img rw rd.luks.uuid=145628bb-0138-4b8b-bc94-2d041c756539 rd.luks.name=145628bb-0138-4b8b-bc94-2d041c756539=lvm root=/dev/lvmvg/root quiet acpi_backlight=vendor intel_pstate=disable $ cat /sys/devices/system/cpu/cpu*/cpufreq/scaling_driver acpi-cpufreq acpi-cpufreq acpi-cpufreq acpi-cpufreq I would prefer to be able to use intel_pstate driver though if at all possible. I've attached hwinfo, lspci, cpuinfo and can provide any additional information which may be needed.
Created attachment 276679 [details] hwinfo-correct corrected hwinfo
Created attachment 276681 [details] lspci -nnv
Created attachment 276683 [details] cpuinfo
It seems that Intel has not updated the maintainer (the assignee) for the intel_pstate driver, it hasn't been Kristen for a few years now. (i.e. I wonder if the current maintainers have seen this.) Have you tried kernel 4.17.3? or 4.18-rc2? and does the problem still occur with them? What is the average frequency of occurrence of the issue? i.e. hours, days, weeks?
I just tried 4.18-rc2 and still have issues, though manifested differently. Mainly, now without the "intel_pstate=disable", I can barely get through a boot. I just had it hard-lock up three times in a row, before getting to SDDM, during boot trying to boot 4.18.-rc2. The fourth time trying I was at least able to get to SDDM, but it hung immediately. Here's the log from that instance. https://ptpb.pw/kTnB Interestingly, loading my current kernel, 4.17.2-1-ARCH, without the "intel_pstate=disable" I also have issues with it hard-locking during boot; around ~70% of the time it will hang during boot. When I am able to boot successfully, in the limited trials I've done so far, I seem to be able to run for long periods of time without any hangs (~2-3 hours tested so far). Previously I would have had hangs with frequencies of minutes in some instances. The issue now with 4.17.2-1-ARCH, when I am able to get past the boot process without "intel_pstate=disable" enabled, is I am unable to suspend properly. When running "systemctl suspend" and trying to resume my computer reboots.
Tried booting into multi-user.target instead of graphical.target with 4.18-rc2 without disabling intel_pstate and still got an almost immediate freeze when logging in, log pasted below: https://ptpb.pw/pSKH
You could try updating the microcode for your processor, as 0XC6 is the current version and you have 0XC2. Your logs should be attached here, instead of an external reference. There seems to be ACPI issues in your log. Was everything O.K. with an older kernel? (you might have to go back to 4.13 to test) And if yes, do you know how to bisect the kernel to isolate the problem commit? I'll see if the current intel_pstate maintainers can have a look at this.
First always run with latest BIOS. Since you have issue in HWP mode, where driver doesn't do much other than activation. So I think something else is triggering this. Depending on the configuration acpi-cpufreq on such system it can't do much cpu frequency scaling, so may not be triggering some other issues. Since this system was previously loaded with Windows, so we need to check if would have used HWP at all. Also what is the configuration acpi-cpufreq can use. So attach: output of acpidump #acpidump > acpi.out I think you can still run for few moments before lockup: run #turbostat --debug with acpi-cpufreq and intel_pstate (default)
Also try one more combination, by changing kernel command line dyndbg="file intel_pstate.c +p" intel_pstate=no_hwp intel_pstate=support_acpi_ppc
Created attachment 277085 [details] 4.18-rc2 journal without "intel_pstate=disable" Here's the log from booting 4.18-rc2 without the "intel_pstate=disable" parameter, took me four times to actually get through boot process and have SDDM load, only to immediately hang after entering my password (still at SDDM screen no desktop loaded).
Created attachment 277087 [details] 4.18-rc2 multi-user.target journal without "intel_pstate=disable" Here's the journal from booting into systemd.unit=multi-user.target on 4.18-rc2 without the "intel_pstate=disable" parameter; booted on first try (though that may have just been lucky), but still immediately hung after entering my password.
@Doug Smythies Interesting, I have enabled all of the steps required for microcode update and it seems to think I should be on 0xc2; see below: $ dmesg | grep microcode [ 0.765652] microcode: sig=0x506e3, pf=0x20, revision=0xc2 [ 0.765799] microcode: Microcode Update Driver: v2.2. $ grep microcode /proc/cpuinfo microcode : 0xc2 microcode : 0xc2 microcode : 0xc2 microcode : 0xc2 $ bsdtar -Oxf /boot/intel-ucode.img | iucode_tool -tb -lS - iucode_tool: system has processor(s) with signature 0x000506e3 microcode bundle 1: (stdin) selected microcodes: 001/160: sig 0x000506e3, pf_mask 0x36, 2017-11-16, rev 0x00c2, size 99328 $ pacman -Qi intel-ucode Name : intel-ucode Version : 20180425-1 Description : Microcode update files for Intel CPUs Architecture : any URL : https://downloadcenter.intel.com/SearchResult.aspx?lang=eng&keyword=processor%20microcode%20data%20file Licenses : custom Groups : None Provides : None Depends On : None Optional Deps : None Required By : None Optional For : None Conflicts With : None Replaces : microcode_ctl Installed Size : 1639.00 KiB Packager : Christian Hesse <arch@eworm.de> Build Date : Mon 07 May 2018 05:11:33 PM EDT Install Date : Wed 09 May 2018 01:26:53 PM EDT Install Reason : Explicitly installed Install Script : No Validated By : Signature I am running the latest BIOS from Dell, I do realize the ACPI seems to be buggy. Not sure what I can do about that. $ sudo dmidecode|head -n 37 # dmidecode 3.1 Getting SMBIOS data from sysfs. SMBIOS 2.8 present. 91 structures occupying 5667 bytes. Table at 0x000EAC00. Handle 0x0000, DMI type 0, 24 bytes BIOS Information Vendor: Dell Inc. Version: 1.7.0 Release Date: 02/23/2018 Address: 0xF0000 Runtime Size: 64 kB ROM Size: 16 MB Characteristics: PCI is supported PNP is supported BIOS is upgradeable BIOS shadowing is allowed Boot from CD is supported Selectable boot is supported EDD is supported 5.25"/1.2 MB floppy services are supported (int 13h) 3.5"/720 kB floppy services are supported (int 13h) 3.5"/2.88 MB floppy services are supported (int 13h) Print screen service is supported (int 5h) 8042 keyboard services are supported (int 9h) Serial services are supported (int 14h) Printer services are supported (int 17h) ACPI is supported USB legacy is supported Smart battery is supported BIOS boot specification is supported Function key-initiated network boot is supported Targeted content distribution is supported UEFI is supported BIOS Revision: 1.7 There was a point in time where I didn't have these issues, might have been 4.13, but can't be certain - I always put it off to look into later hoping newer kernels would address. I know nothing about bisects but would be willing to look into what that entails.
@Srinivas Pandruvada I'm uploading the files you requested now. Please note this machine was originally purchased with Ubuntu and did not come installed with Windows. I have added an m.2 nvme drive with windows install recently though, but I don't think is related to my issues. I will try to boot with the additional parameters you suggested (dyndbg="file intel_pstate.c +p" intel_pstate=no_hwp intel_pstate=support_acpi_ppc) and report back.
Created attachment 277089 [details] acpi.out @Srinivas Pandruvada So attach: output of acpidump #acpidump > acpi.out
Created attachment 277091 [details] turbostat acpi-cpufreq @Srinivas Pandruvada run #turbostat --debug with acpi-cpufreq
Created attachment 277093 [details] turbostat intel_pstate @ Srinivas Pandruvada I think you can still run for few moments before lockup: run #turbostat --debug with intel_pstate (default) ** Please note, I now seem to be able to run with intel_pstate enabled without getting hard-locks (at least in limited testing so far, lasting up to 3-4 hours, wherein before it would lock in minutes at times). However, now the issue is if I boot without disabling intel_pstate I will NOT make it through the boot process up to ~70% of the time. It will hang/hard-lock at random points throughout the boot before loading SDDM. Othertimes, even if SDDM loads it will hang prior to loading the desktop. If I make it to the desktop though - it appears to be okay. The other issue when not disabling intel_pstate, the few times I actually make it to a desktop, is when suspending the computer will reboot when attempting to resume from said suspend.
Pretty idle systems in both cases. Can you run something and collect turbostate: - I want to see if you are able to scale frequency with acpi-cpufreq I think you may be getting some thermal event during boot and resume, which acpi-cpufreq can handle. So try with the options I suggested, it may help and dump something in logs also.
Created attachment 277095 [details] 4.17.2-1-ARCH journal with "dyndbg="file intel_pstate.c +p" intel_pstate=no_hwp intel_pstate=support_acpi_ppc" @Srinivas Pandruvada Here's a log from a boot with the parameters you instructed. Did get a hang during boot the first time, was able to make it past boot to desktop the second time. However, upon putting the system in suspend, when trying to resume system hung. Will get you better turbostate logs when using acpi-cpufreq here shortly.
Created attachment 277097 [details] turbostat --debug acpi-freq with increased system load @Srinivas Pandruvada Here's a turbostat which hopefully is more useful?
I see that acpi-cpufreq is basically running without turbo even with 98% load. So this may be preventing some thermal issues. Also I see some thermal events. When you were booted with intel-pstate. Disable turbo #echo 1 > /sys/device/system/cpu/intel_pstate/no_turbo and do suspend resume and check.
Also along with above suggestion also do #rdmsr 0x1a0 whatever you read just set bit 38 of it For example you read 0x850089 then write 0x4000850089 #wrmsr 0x1a0 0x4000850089
Created attachment 277103 [details] journal 4.17.3-1-ARCH with "dyndbg="file intel_pstate.c +p" intel_pstate=no_hwp intel_pstate=support_acpi_ppc" and disabled turbo @Srinivas Pandruvada Here's another journal from a recent boot. I booted with the additional kernel parameters you've instructed (now on 4.17.3-1-ARCH) and disabling the turbo as per your instructions. $ cat /etc/systemd/system/disable_turbo.service [Unit] Description=disable turbo when on intel_pstate [Service] Type=oneshot RemainAfterExit=yes ExecStart = /home/ghost/scripts/disable_pstate_turbo.sh [Install] WantedBy=multi-user.target $ cat scripts/disable_pstate_turbo.sh #!/bin/sh if [ -f /sys/devices/system/cpu/intel_pstate/no_turbo ]; then echo 1 > /sys/devices/system/cpu/intel_pstate/no_turbo && printf "intel_pstate turbo disabled" ; else printf "acpi-freq not intel_pstate - no turbo to disable"; fi Again, with pstate enabled (even if hwp is disabled) once I make it past the boot process (which is no small feat) the system is reliable. I was able to suspend multiple times during this session and bring the computer back up. However, at the end of the night I suspended and while suspended plugged in the ac adapter to leave overnight to charge. This morning when I went to resume, I unplugged the ac adapter and tried to resume. The computer performed a reboot again instead of resume. It literally took me booting ten (10) times this morning before I was able to make it through a boot to my desktop. I'll be happy to make those changes to the MSR you recommended. Can you please advise as to what that's actually changing though? I assume that's not a permanent change and would need to be done on each boot when testing? I tried looking up what that specific MSR is in the Intel® 64 and IA-32 Architectures Software Developer’s Manual without luck (obviously over my head).
Created attachment 277105 [details] turbostat --debug intel_pstate w/ edited msr, disabled turbo, increased load I made the edits to the msr you instructed. They did read as only "850089" at first, and after writing the new value this is what I currently read: $ sudo rdmsr -a 0x1a0 4000850089 4000850089 4000850089 4000850089 Ran another turbostat with increased load in case that's helpful.
Changing MSR is not the solution. This basically is proving theory that the platform is having thermal issue when boot or resume. Disabling turbo will help here, hence you are able to resume. Once system is on you can use thermald to control thermal. But during boot it won't help. So we need to debug why acpi-cpufreq is not using turbo at all, either because of a bug or got some notification. So can you boot with acpi-cpufreq with intel_pstate disabled with the following kernel parameters. Collect dmesg. Then run some workload and collect dmesg. intel_pstate=disable "dyndbg="file processor_perflib.c +p".
I see; re: finding out why acpi-cpufreq is not using turbo at all may have had to do with TLP being installed? I've disabled TLP for the time being just in case. Please note the intel_pstate issues were present irrespective of having TLP installed and enabled. I'll attach the dmesg output you requested here shortly.
Created attachment 277109 [details] dmesg acpi-cpufreq at boot intel_pstate=disable dyndbg="file processor_perflib.c +p" Boot with acpi-cpufreq with intel_pstate disabled with the following kernel parameters. Collect dmesg. intel_pstate=disable dyndbg="file processor_perflib.c +p"
Created attachment 277111 [details] dmesg acpi-cpufreq increased load intel_pstate=disable dyndbg="file processor_perflib.c +p" Boot with acpi-cpufreq with intel_pstate disabled with the following kernel parameters ... Then run some workload and collect dmesg. intel_pstate=disable dyndbg="file processor_perflib.c +p"
I don't see any thermal notification to limit frequency and previous log showed that acpi-cputfreq didn't go turbo at all. That may be a bug. If we fix this, probably you will have the same issue with acpi-cpufreq. You already checked that disabling turbo helped in intel-pstate during resume. So this is somehow caused by system thermal. For double confirm, run both governors and check a single thread workload like stress -c 1 and run turbostat. I will see if I can get get the same platform to check.
Created attachment 277117 [details] acpi-cpufreq schedutil governor turbostat single thread stress For double confirm, run both governors and check a single thread workload like stress -c 1 and run turbostat. schedutil governor (default) $ cat /sys/devices/system/cpu/cpu*/cpufreq/scaling_available_governors performance schedutil performance schedutil performance schedutil performance schedutil $ cat /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor schedutil schedutil schedutil schedutil
Created attachment 277119 [details] acpi-cpufreq performance governor turbostat single thread stress For double confirm, run both governors and check a single thread workload like stress -c 1 and run turbostat. performance governor $ cat /sys/devices/system/cpu/cpu*/cpufreq/scaling_available_governors performance schedutil performance schedutil performance schedutil performance schedutil $ cat /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor performance performance performance performance
I do see turbo with acpi-cpufreq. Don't know why it was in the previous log with 96%+ busy. Let me check if I can find this platform to try.
I think TLP was the culprit for the discrepancy (see possibly relevant sections from /etc/default/tlp below that could explain); turbo or no-turbo may have depended on if I was running on AC or battery? # Set energy performance hints (HWP) for Intel P-state governor: # performance, balance_performance, default, balance_power, power # Values are given in order of increasing power saving. # Note: Intel Skylake or newer CPU and Kernel >= 4.10 required. CPU_HWP_ON_AC=balance_performance CPU_HWP_ON_BAT=balance_power # Minimize number of used CPU cores/hyper-threads under light load conditions: # 0=disable, 1=enable. SCHED_POWERSAVE_ON_AC=0 SCHED_POWERSAVE_ON_BAT=1 # Set CPU performance versus energy savings policy: # performance, balance-performance, default, balance-power, power. # Values are given in order of increasing power saving. # Requires kernel module msr and x86_energy_perf_policy from linux-tools. ENERGY_PERF_POLICY_ON_AC=performance ENERGY_PERF_POLICY_ON_BAT=power
If you change default acpi-cpufreq governor to performance at boot, do you still don't have issue? For this to be default you need to change the kernel config CONFIG_CPU_FREQ_DEFAULT_GOV_PERFORMANCE=y.
and ofcourse rebuild the kernel after changing the kernel config.
(In reply to carbonchauvinist from comment #12) > @Doug Smythies > > Interesting, I have enabled all of the steps required for microcode update > and it seems to think I should be on 0xc2; see below: > See here: https://www.intel.com/content/dam/www/public/us/en/documents/sa00115-microcode-update-guidance.pdf at the bottom of page 15. But let's put that on hold for now. (In reply to carbonchauvinist from comment #25) > I see; re: finding out why acpi-cpufreq is not using turbo at all may have > had to do with TLP being installed? I've disabled TLP for the time being > just in case. Please note the intel_pstate issues were present irrespective > of having TLP installed and enabled. Then suggest to leave TLP disbaled and/or removed and out of it. (In reply to carbonchauvinist from comment #32) > I think TLP was the culprit for the discrepancy (see possibly relevant > sections from /etc/default/tlp below that could explain); turbo or no-turbo > may have depended on if I was running on AC or battery? Again, suggest to take TLP out of it. I don't know about this particular Dell computer, but at least some force different operating conditions depending on the power adapter being plugged in or not, and even if the adapter is not a Dell adapter. These tests must be done under the exact same conditions between tests. (In reply to carbonchauvinist from comment #12) > @Doug Smythies > There was a point in time where I didn't have these issues, might have been > 4.13, but can't be certain - I always put it off to look into later hoping > newer kernels would address. Myself (but I'd defer to Srinivas), I think that would be important to identify for certain.
(In reply to Doug Smythies from comment #35) > See here: > > https://www.intel.com/content/dam/www/public/us/en/documents/sa00115- > microcode-update-guidance.pdf > > at the bottom of page 15. > But let's put that on hold for now. Thanks, for this, and for getting this to the right people to begin the process. I've been assured by folks on arch irc that I have the latest publicly available microcode for my cpu at this time. > Then suggest to leave TLP disbaled and/or removed and out of it. > Again, suggest to take TLP out of it. I don't know about this particular > Dell computer, but at least some force different operating conditions > depending on the power adapter being plugged in or not, and even if the > adapter is not a Dell adapter. These tests must be done under the exact same > conditions between tests. Of course, I've already disabled and removed TLP entirely from my system for the time being. I'm embarrassed I forgot to mention I'd installed it on my machine once I finally found a solution for my freezing issues (disabling intel_pstate) and didn't put the two together about the turbo discrepancies until much later than I should have. > Myself (but I'd defer to Srinivas), I think that would be important to > identify for certain. I can look to try this out this upcoming weekend; I guess I can get just download the old linux versions from arch archive. Maybe just download all >= 4.10 and then just continue to test them until I run into the problem?
(In reply to Srinivas Pandruvada from comment #33) > If you change default acpi-cpufreq governor to performance at boot, do you > still don't have issue? For this to be default you need to change the kernel > config > CONFIG_CPU_FREQ_DEFAULT_GOV_PERFORMANCE=y. I'm not sure the issue you're referring to here specifically. I think not hitting the turbo on acpi-freq I was experiencing earlier when troubleshooting was due to TLP and/or being on battery and not AC. I've now removed TLP completely for sake of testing and I'm running all tests connected to AC moving forward (unless asked otherwise) and apologize for the noise. I've built a version of 4.17.3-1-ARCH with performance default governor. I tried one boot without disabling intel_pstate and it hung in the boot process, so that issue is still present. I then booted this new kernel with the intel_pstate=disable dyndbg="file processor_perflib.c +p" parameters. I'll attach the dmesg and turbostat shortly, but it appeared to go into Turbo mode when increasing load. Is there a better way to capture the turbostat output? I've been using "sudo turbostat --debug > turbostat.out 2>&1" but it seems to truncate (or not capture all) the output?
Created attachment 277139 [details] dmesg acpi-cpufreq performance default governor dmesg at boot with acpi-freq and performance as rebuilt kernel default governor
Created attachment 277141 [details] turbostat --debug acpi-freq performance default increased load turbostat with increased load with performance as default governor only additional line to dmesg from boot after increased load was: $ diff dmesg.out dmesg_after.out 1013a1014 > [ 181.557943] capability: warning: `turbostat' uses 32-bit capabilities > (legacy support in use)
Created attachment 277143 [details] attachment-15565-0.html ? Hello, I will be OOO between July 3 and July 5. Expect delayed response.
So even acpi-cpufreq performance governor as default didn't show any PPC notifications if there is any thermal issue. Even if intel-pstate dafault powersave behaved aggressively then also it will still can't be more aggressive than acpi-cpufreq performance governor. So kind of very strange issue. Only additional thing acpi-cpufreq would have done is to inform SMM. But acpidump shows acpi_gbl_FADT.pstate_control as 0 (" P-State Control : 00"), so this call will simply return. One thing was suggested by Len, is that build with CONFIG_CPU_FREQ_DEFAULT_GOV_POWERSAVE=y and other CONFIG_CPU_FREQ_DEFAULT_GOV_* as "not set" and boot with kernel command line "intel_pstate=passive". and see if it can boot.
Okay. So rebuilt as instructed: # CPU Frequency scaling # CONFIG_CPU_FREQ=y CONFIG_CPU_FREQ_GOV_ATTR_SET=y CONFIG_CPU_FREQ_GOV_COMMON=y CONFIG_CPU_FREQ_STAT=y # CONFIG_CPU_FREQ_DEFAULT_GOV_PERFORMANCE is not set CONFIG_CPU_FREQ_DEFAULT_GOV_POWERSAVE=y # CONFIG_CPU_FREQ_DEFAULT_GOV_USERSPACE is not set # CONFIG_CPU_FREQ_DEFAULT_GOV_ONDEMAND is not set # CONFIG_CPU_FREQ_DEFAULT_GOV_CONSERVATIVE is not set # CONFIG_CPU_FREQ_DEFAULT_GOV_SCHEDUTIL is not set CONFIG_CPU_FREQ_GOV_PERFORMANCE=y CONFIG_CPU_FREQ_GOV_POWERSAVE=m CONFIG_CPU_FREQ_GOV_USERSPACE=m CONFIG_CPU_FREQ_GOV_ONDEMAND=m CONFIG_CPU_FREQ_GOV_CONSERVATIVE=m CONFIG_CPU_FREQ_GOV_SCHEDUTIL=y I was just able to boot with intel_pstate=passive literally 10 times in a row (connected to AC) which is very encouraging. $ journalctl --list-boots ... -10 8b4e176c56c74d6ab0b94a48e2b7a859 Tue 2018-07-03 19:35:33 EDT—Tue 2018-07-03 19:36:00 EDT -9 e3e6ffc7fa3744269454e83e74a90ee7 Tue 2018-07-03 19:36:41 EDT—Tue 2018-07-03 19:37:26 EDT -8 7b54ba3436334ce483f0ac804bbccf05 Tue 2018-07-03 19:38:08 EDT—Tue 2018-07-03 19:38:36 EDT -7 ea41045d5cc444349d36e0141489ce1e Tue 2018-07-03 19:39:19 EDT—Tue 2018-07-03 19:39:56 EDT -6 7fc80a4270324dd8ada6dd8ea66dfcea Tue 2018-07-03 19:40:36 EDT—Tue 2018-07-03 19:41:11 EDT -5 400827dda7b041318b28040977670d7f Tue 2018-07-03 19:41:48 EDT—Tue 2018-07-03 19:42:15 EDT -4 066d23f2d5f04712af19fe67159c591e Tue 2018-07-03 19:42:57 EDT—Tue 2018-07-03 19:43:26 EDT -3 688d37a799b0404ea78b19cc0fca64ec Tue 2018-07-03 19:44:06 EDT—Tue 2018-07-03 19:45:47 EDT -2 bcda0e20d07541acb2d955b1ef4619d2 Tue 2018-07-03 19:46:23 EDT—Tue 2018-07-03 19:46:42 EDT -1 31e39c62d9ad48eda5f8f30094797221 Tue 2018-07-03 19:50:15 EDT—Tue 2018-07-03 19:50:54 EDT 0 eab59bb0608148bcacf37c5355512571 Tue 2018-07-03 19:51:29 EDT—Tue 2018-07-03 19:52:09 EDT $ cat /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor powersave powersave powersave powersave $ cat /sys/devices/system/cpu/cpu*/cpufreq/scaling_driver intel_cpufreq intel_cpufreq intel_cpufreq intel_cpufreq $ cat /proc/cmdline initrd=\intel-ucode.img initrd=\initramfs-linux-custom.img rw rd.luks.uuid=145628bb-0138-4b8b-bc94-2d041c756539 rd.luks.name=145628bb-0138-4b8b-bc94-2d041c756539=lvm root=/dev/lvmvg/root quiet intel_pstate=passive dyndbg="file processor_perflib.c +p" I'll post a dmesg and journal from the last boot as well in case it's useful.
Created attachment 277161 [details] dmesg intel-cpufreq powersave default governor boot intel_pstate=passive dyndbg="file processor_perflib.c +p" this is the dmesg from booting 4.17.3-1 with intel-cpufreq (intel_pstate=passive dyndbg="file processor_perflib.c +p") and default governor set to powersave
Created attachment 277163 [details] journal intel-cpufreq powersave default governor boot intel_pstate=passive dyndbg="file processor_perflib.c +p" this is the journal from booting 4.17.3-1 with intel-cpufreq (intel_pstate=passive dyndbg="file processor_perflib.c +p") and default governor set to powersave
This confirms that platform has thermal issues during boot (usually during boot because of extra processing, we will see higher CPU temperature). So only likes less aggressive governor during boot (But how the acpi-cpufreq with performance governor as default has no issue, this is a mystery). Since you have changed base thermal by adding addition or replace NVMe, don't know if it changed the thermal model, the system was built for. I am waiting for Mario from Dell to confirm if he sees the same issue on this model. I can build a patch to allow disable turbo during boot and allow later from sysfs, to be enabled.
(In reply to Srinivas Pandruvada from comment #45) > This confirms that platform has thermal issues during boot (usually during > boot because of extra processing, we will see higher CPU temperature). I think that also a race condition is a possibility, where forcing the slower clock rate during boot made the condition not occur. Note also, that one of the OPs turbostat logs seemed to indicate that full load on all CPU's was O.K., with processor package temperature only getting to 85 degrees.
Created attachment 277207 [details] boot kernel command line turbo disable @Doug Possible also. Going high during boot time than runtime is not unusual because of uncore initialization, which get poked i915 and other PCI devices. Attached a patch; Boot by adding to kernel command line, and later change it from intel_pstate sysfs (/sys/devices/system/cpu/intel_pstate/no_turbo) intel_pstate=no_turbo
(In reply to Srinivas Pandruvada from comment #45) > Since you have changed base thermal by adding addition or replace NVMe, > don't know if it changed the thermal model, the system was built for. Just a small note, I only added the m.2 NVMe to this system in May and importantly I was also having the issues prior to the addition. $ lsblk NAME FSTYPE TYPE MOUNTPOINT sda disk ├─sda1 vfat part /boot └─sda2 crypto_LUKS part └─lvm LVM2_member crypt ├─lvmvg-root ext4 lvm / ├─lvmvg-var ext4 lvm /var └─lvmvg-home ext4 lvm /home nvme0n1 disk ├─nvme0n1p1 ntfs part ├─nvme0n1p2 vfat part ├─nvme0n1p3 part └─nvme0n1p4 ntfs part I run linux from the samsung evo 850 sata and don't access the m.2 at all when booted if that matters. Also since changing to default powersave governor I've even been able to run with intel_pstate fully enabled now for close to two days of uptime. I've been suspending and resuming without issue (all connected to AC). Of course performance is noticeably slower. $ uptime 19:46:03 up 1 day, 22:03, 1 user, load average: 0.47, 0.53, 0.51 $ cat /proc/cmdline initrd=\intel-ucode.img initrd=\initramfs-linux-custom.img rw rd.luks.uuid=145628bb-0138-4b8b-bc94-2d041c756539 rd.luks.name=145628bb-0138-4b8b-bc94-2d041c756539=lvm root=/dev/lvmvg/root quiet dyndbg="file processor_perflib.c +p" $ cat /sys/devices/system/cpu/cpu0/cpufreq/scaling_driver intel_pstate $ cat /sys/devices/system/cpu/cpu0/cpufreq/scaling_governor powersave > I can build a patch to allow disable turbo during boot and allow later from > sysfs, to be enabled. Will try your patch and report back.
I'm not sure if I did something wrong or not. I've added your patch to vanilla kernel, but get an error in one out of the three hunks when trying to build: $ makepkg -f ... patching file drivers/cpufreq/intel_pstate.c Hunk #1 FAILED at 297. Hunk #2 succeeded at 2042 (offset -221 lines). Hunk #3 succeeded at 2340 (offset -229 lines). 1 out of 3 hunks FAILED -- saving rejects to file drivers/cpufreq/intel_pstate.c.rej ==> ERROR: A failure occurred in prepare(). Aborting... $ cat $(locate intel_pstate.c.rej) --- drivers/cpufreq/intel_pstate.c +++ drivers/cpufreq/intel_pstate.c @@ -297,6 +297,7 @@ static int hwp_active __read_mostly; static int hwp_mode_bdw __read_mostly; static bool per_cpu_limits __read_mostly; static bool hwp_boost __read_mostly; +static bool global_no_turbo __read_mostly; static struct cpufreq_driver *intel_pstate_driver __read_mostly;
What baseline you are using? I can make a patch for that. Or you can just add the line manually in your partially merged intel_pstate.c static bool global_no_turbo __read_mostly;
I'm using the 4.17.3-1 baseline, my apologies if I misunderstood. Which did you patch? I can move to that instead.
Created attachment 277233 [details] disable_boot_turbo_option ported to 4.17
Created attachment 277235 [details] dmesg turbo_disable_patch 4.18.0-rc3 intel_pstate=no_turbo Okay, got your patch applied to mainline (4.18.0-rc3) and just booted successfully (see attached dmesg from boot). $ uname -a Linux 5510 4.18.0-rc3-custom #1 SMP PREEMPT Fri Jul 6 11:56:18 EDT 2018 x86_64 GNU/Linux $ cat /sys/devices/system/cpu/intel_pstate/no_turbo 1 $ cat /proc/cmdline initrd=\intel-ucode.img initrd=\initramfs-linux-custom.img rw rd.luks.uuid=145628bb-0138-4b8b-bc94-2d041c756539 rd.luks.name=145628bb-0138-4b8b-bc94-2d041c756539=lvm root=/dev/lvmvg/root quiet intel_pstate=no_turbo dyndbg="file processor_perflib.c +p" $ cat /sys/devices/system/cpu/cpu0/cpufreq/scaling_driver intel_pstate $ cat /sys/devices/system/cpu/cpu0/cpufreq/scaling_governor powersave $ dmesg | grep -i turbo [ 0.000000] Command line: initrd=\intel-ucode.img initrd=\initramfs-linux-custom.img rw rd.luks.uuid=145628bb-0138-4b8b-bc94-2d041c756539 rd.luks.name=145628bb-0138-4b8b-bc94-2d041c756539=lvm root=/dev/lvmvg/root quiet intel_pstate=no_turbo dyndbg="file processor_perflib.c +p" [ 0.000000] intel_pstate: Turbo disabled by boot options [ 0.000000] Kernel command line: initrd=\intel-ucode.img initrd=\initramfs-linux-custom.img rw rd.luks.uuid=145628bb-0138-4b8b-bc94-2d041c756539 rd.luks.name=145628bb-0138-4b8b-bc94-2d041c756539=lvm root=/dev/lvmvg/root quiet intel_pstate=no_turbo dyndbg="file processor_perflib.c +p" Assuming all that looks correct and as expected I will continue to test make sure it can boot without issue and I'll run some increased loads to log with turbostat after making sure to enable turbo manually (# echo 0 > /sys/devices/system/cpu/intel_pstate/no_turbo) after booting.
So somehow the faster cpu frequencies either causing thermal or triggering some race condition in some code path.
Created attachment 277255 [details] journalctl -k 4.18.0-rc3 intel_pstate=no_turbo freeze journalctl -k 4.18.0-rc3 intel_pstate=no_turbo freeze With the patch applied to 4.18.0-rc3 and booting with intel_pstate=no_turbo I can get to the desktop without issue. I can run an increased load test (i.e. compile gcc5) just fine with turbo disabled. However, when trying to run increased load test (compile gcc5) after manually re-enabling turbo (# echo 0 > /sys/devices/system/cpu/intel_pstate/no_turbo) I've gotten a hard freeze every time. In fact, just re-enabling turbo seems to be enough to cause a freeze - even without any increased load. I've attached the journal (kernel messages) from one of the boots where this happened for reference (as soon as I manually re-enabled turbo, I then pulled up my Plasma Application Dashboard and it froze before I could make a selection). Should I try the patch on 4.17.3-1 too?
Created attachment 277257 [details] journalctl -k 4.17.3-1 4.17.3-1 kernel also freezes with applied patch and when manually enabling turbo after booting with intel_pstate=no_turbo parameter (see journalctl kernel messages attached). If I don't enable turbo after boot, so far, I haven't had any freezes even under increased load.
Created attachment 277259 [details] journalctl -k 4.17.3-1 patch intel_pstate=no_turbo - freeze when turbo enabled *correctted attachment 4.17.3-1 kernel also freezes with applied patch and when manually enabling turbo after booting with intel_pstate=no_turbo parameter (see journalctl kernel messages attached). If I don't enable turbo after boot, so far, I haven't had any freezes even under increased load.
(In reply to carbonchauvinist from comment #55) > However, when trying to run increased load test (compile gcc5) after > manually re-enabling turbo (# echo 0 > > /sys/devices/system/cpu/intel_pstate/no_turbo) I've gotten a hard freeze > every time. > > In fact, just re-enabling turbo seems to be enough to cause a freeze - even > without any increased load. Interesting. How quickly does it freeze without any increased load? Myself, I still think there could be value added in trying older kernels. From comment 37: > Is there a better way to capture the turbostat output? > I've been using "sudo turbostat --debug > turbostat.out 2>&1" > but it seems to truncate (or not capture all) the output? The output looks O.K. to me. Note that turbostat doesn't list some package specific stuff for CPUs > 0. Myself, and for what we are looking into, I'd be tempted to run something like this: sudo turbostat --quiet --Summary --show Busy%,Bzy_MHz,IRQ,PkgTmp,PkgWatt,GFXWatt,CorWatt --interval 1
Try one more thing: After booting with no_turbo option run a turbostat with options as Doug suggested. But remove --quiet option $ sudo turbostat --show Busy%,Bzy_MHz,IRQ,PkgTmp,PkgWatt,GFXWatt,CorWatt --interval 1 Once you enable turbo, you are seeing freeze so probably you can't run turbostat. If you can run with some load. Attach the output. Then reboot by just adding just one kernel parameter cpufreq.off=1 Read and write some MSRs. You will need msr-util package. Attach the output of all of these. #rdmsr -a 0x770 #wrmsr -a 0x770 0x01 #rdmsr -a 0x771 #rdmsr -a 0x774 #wrmsr -a 0x774 0x80001a08 Collect at this point all the outputs of command issued as next command may cause freeze. #wrmsr -a 0x774 0x80002308 If doesn't freeze then run some workload and again run turbostat with above options with some load. If it still causes freeze, then go back to some older kernel and try and see if it doesn't freeze. If it is we have to do bisect.
(In reply to Doug Smythies from comment #58) > Interesting. How quickly does it freeze without any increased load? It's not an every time thing unfortunately. The one time I can specifically recall (sorry, there have been a lot of boots and tests recently): I'd only pulled up my full screen Plasma Application Dashboard and it froze. Also to further clarify, once enabling turbo after booting, I am sometimes able to run an increased load without freezes. There doesn't appear to be any rhyme or reason as to why sometimes it freezes and others it doesn't. The ratio seems to be approx. 7:3 (freezes:no-freezes). In fact one time while testing, after booting, I echoed "0" to enable turbo, then caused increase load (compile gcc5) and there was no freeze initially (normally happens during the "Extracting gcc-5.5.0.tar.xz with bsdtar" step). I cancelled out of the compile, re-echoed "0" to /sys/devices/system/cpu/intel_pstate/no_turbo and when I went to compile gcc5 again, it did freeze - and again at the "extracting... with bsdtar" step. Clear as mud, I'm sure.
Created attachment 277279 [details] turbostat 4.17.3-1 intel_pstate=no_turbo (increased load included) (In reply to Srinivas Pandruvada from comment #59) > Try one more thing: > After booting with no_turbo option > > run a turbostat with options as Doug suggested. But remove --quiet option > > $ sudo turbostat --show Busy%,Bzy_MHz,IRQ,PkgTmp,PkgWatt,GFXWatt,CorWatt > --interval 1 ran turbostat with requested options and increased load during test
Created attachment 277281 [details] turbostat 4.17.3-1 intel_pstate=no_turbo - enabled turbo (increased load included) (In reply to Srinivas Pandruvada from comment #59) > Try one more thing: > After booting with no_turbo option >... > Once you enable turbo, you are seeing freeze so probably you can't run > turbostat. If you can run with some load. > > > Attach the output. booted with no_turbo option, enabled turbo, increased load, this is one of the instances where it did not freeze.
Created attachment 277283 [details] turbostat 4.17.3-1 intel_pstate=no_turbo cpufreq.off=1 custom MSR values with increased load (In reply to Srinivas Pandruvada from comment #59) > Then reboot by just adding just one kernel parameter > cpufreq.off=1 Assuming I'm understanding you correctly, I'm booted with the following cmdline: $ journalctl -b -1 | grep Command Jul 08 10:33:41 archlinux kernel: Command line: initrd=\intel-ucode.img initrd=\initramfs-linux-custom.img rw rd.luks.uuid=145628bb-0138-4b8b-bc94-2d041c756539 rd.luks.name=145628bb-0138-4b8b-bc94-2d041c756539=lvm root=/dev/lvmvg/root quiet intel_pstate=no_turbo cpufreq.off=1 dyndbg="file processor_perflib.c +p" > Read and write some MSRs. You will need msr-util package. Attach the output > of all of these. > > #rdmsr -a 0x770 > #wrmsr -a 0x770 0x01 > #rdmsr -a 0x771 > #rdmsr -a 0x774 > #wrmsr -a 0x774 0x80001a08 > > Collect at this point all the outputs of command issued as next command may > cause freeze. > #wrmsr -a 0x774 0x80002308 Again, assuming I did this correctly, here's the output of the above commands (I tried to re-read the addresses after writing to confirm they'd changed, but I may be misunderstanding what's happening here): [root@5510 ghost]# rdmsr -a 0x770 [root@5510 ghost]# wrmsr -a 0x770 0x01 [root@5510 ghost]# [root@5510 ghost]# rdmsr -a 0x770 [root@5510 ghost]# [root@5510 ghost]# [root@5510 ghost]# rdmsr -a 0x771 [root@5510 ghost]# [root@5510 ghost]# [root@5510 ghost]# rdmsr -a 0x774 [root@5510 ghost]# wrmsr -a 0x774 0x80001a08 [root@5510 ghost]# [root@5510 ghost]# rdmsr -a 0x774 [root@5510 ghost]# [root@5510 ghost]# [root@5510 ghost]# [root@5510 ghost]# wrmsr -a 0x774 0x80002308 [root@5510 ghost]# [root@5510 ghost]# rdmsr -a 0x774 [root@5510 ghost]# > If doesn't freeze then run some workload and again run turbostat with above > options with some load. I've attached the turbostat output of an increased workload for your review
(In reply to carbonchauvinist from comment #63) > > Read and write some MSRs. You will need msr-util package. Attach the output > > of all of these. > > > > #rdmsr -a 0x770 > > #wrmsr -a 0x770 0x01 > > #rdmsr -a 0x771 > > #rdmsr -a 0x774 > > #wrmsr -a 0x774 0x80001a08 > > > > Collect at this point all the outputs of command issued as next command may > > cause freeze. > > #wrmsr -a 0x774 0x80002308 > > Again, assuming I did this correctly, here's the output of the above > commands (I tried to re-read the addresses after writing to confirm they'd > changed, but I may be misunderstanding what's happening here): > > [root@5510 ghost]# rdmsr -a 0x770 > [root@5510 ghost]# wrmsr -a 0x770 0x01 > [root@5510 ghost]# > [root@5510 ghost]# rdmsr -a 0x770 > [root@5510 ghost]# > [root@5510 ghost]# > [root@5510 ghost]# rdmsr -a 0x771 > [root@5510 ghost]# > [root@5510 ghost]# > [root@5510 ghost]# rdmsr -a 0x774 > [root@5510 ghost]# wrmsr -a 0x774 0x80001a08 > [root@5510 ghost]# > [root@5510 ghost]# rdmsr -a 0x774 > [root@5510 ghost]# > [root@5510 ghost]# > [root@5510 ghost]# > [root@5510 ghost]# wrmsr -a 0x774 0x80002308 > [root@5510 ghost]# > [root@5510 ghost]# rdmsr -a 0x774 > [root@5510 ghost]# > Ya, something is wrong. It looks to me as though the msr module is not loaded (and note that the commands to not thow an error when this occurs). turbostat will load it, or you can load it manually: sudo modprobe msr Then you should get some results from rdmsr commands. Example (I do not have HWP, so use pstates as my example (idle system)): $ sudo rdmsr --bitfield 15:8 -d -a 0x198 16 16 16 16 16 16 16 16
Created attachment 277291 [details] *CORRECTED* turbostat 4.17.3-1 intel_pstate=no_turbo cpufreq.off=1 custom MSR values with increased load *CORRECTED* (In reply to Srinivas Pandruvada from comment #59) > Then reboot by just adding just one kernel parameter > cpufreq.off=1 Booted with the following cmdline: $ journalctl -b -1 | grep Command Jul 08 10:33:41 archlinux kernel: Command line: initrd=\intel-ucode.img initrd=\initramfs-linux-custom.img rw rd.luks.uuid=145628bb-0138-4b8b-bc94-2d041c756539 rd.luks.name=145628bb-0138-4b8b-bc94-2d041c756539=lvm root=/dev/lvmvg/root quiet intel_pstate=no_turbo cpufreq.off=1 dyndbg="file processor_perflib.c +p" > Read and write some MSRs. You will need msr-util package. Attach the output > of all of these. > > #rdmsr -a 0x770 > #wrmsr -a 0x770 0x01 > #rdmsr -a 0x771 > #rdmsr -a 0x774 > #wrmsr -a 0x774 0x80001a08 > > Collect at this point all the outputs of command issued as next command may > cause freeze. > #wrmsr -a 0x774 0x80002308 Thanks to Doug's help, here's the CORRECTED output of the above commands: $ sudo su [root@5510 ghost]# [root@5510 ghost]# [root@5510 ghost]# rdmsr -a 0x770 0 0 0 0 [root@5510 ghost]# wrmsr -a 0x770 0x01 [root@5510 ghost]# rdmsr -a 0x770 1 1 1 1 [root@5510 ghost]# [root@5510 ghost]# [root@5510 ghost]# rdmsr -a 0x771 1081a23 1081a23 1081a23 1081a23 [root@5510 ghost]# [root@5510 ghost]# [root@5510 ghost]# [root@5510 ghost]# rdmsr -a 0x774 8000ff01 8000ff01 8000ff01 8000ff01 [root@5510 ghost]# wrmsr -a 0x774 0x80001a08 [root@5510 ghost]# rdmsr -a 0x774 80001a08 80001a08 80001a08 80001a08 [root@5510 ghost]# [root@5510 ghost]# [root@5510 ghost]# wrmsr -a 0x774 0x80002308 [root@5510 ghost]# rdmsr -a 0x774 80002308 80002308 80002308 80002308 [root@5510 ghost]# > If doesn't freeze then run some workload and again run turbostat with above > options with some load. I was able to run my usual increased load test (compile gcc5) and have attached the output here. I did get a hang once I completed the test and attempted to open a tab in the Falkon browser.
So you did get a freeze even without using intel_pstate? Basically taking platform to take turbo causes some issue in other part of the code. May be it is just UI hang? If you ssh into this platform from other computer it may be still responding. Just boot ubuntu or Fedora from a USB disk and check if you see freeze. Fedora usually migrates to prettu latest kernel.
(In reply to Srinivas Pandruvada from comment #66) > So you did get a freeze even without using intel_pstate? Basically taking > platform to take turbo causes some issue in other part of the code. May be > it is just UI hang? If you ssh into this platform from other computer it may > be still responding. > > Just boot ubuntu or Fedora from a USB disk and check if you see freeze. > Fedora usually migrates to prettu latest kernel. So I just retested five times in a row still on the 4.17.3-1-custom built kernel (boot with intel_pstate=no_turbo cpufreq.off=1, load msr module, write msr values as instructed, increase load by compiling gcc5) and got a hard freeze each time. In each instance I can't ssh in (confirmed I could ssh in normally prior to each hang for each of the five boots i tested), I also can't switch ttys. In some instances I get a flashing caps lock key in other's I don't. I have to hold the power button down for 3-5 seconds to hard power off the machine. I'll boot from latest fedora and/or ubuntu and report back shortly.
Can confirm that booting with latest fedora live usb produces freezes on my machine (four boots and four freezes in a row). Though I've been able to make it through the boot process in each one (the freezes happen after logging into desktop). I was able to run some simple stress tests ( stress -c 4 ) for some time during one of the boots without a hang, but then hung later while just changing directory in terminal: $ uname -a Linux localhost-live 4.16.3-301.fc28.x86_64 #1 SMP Mon Apr 23 21:59:58 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux intel_pstate was scaling_driver, no_turbo was set to "0", powersave was the scaling_governor. I'll try ubuntu also.
Same with Ubuntu's latest live iso (18.04 LTS), had a couple of freezes during the boot process, also had one when I was able to get booted just trying to cat some files in the terminal. $ uname -a Linux ubuntu 4.15.0-20-generic #21-Ubuntu SMP Tue Apr 24 06:16:15 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux
This issue is happening of irrespective of intel_pstate driver. Enabling turbo is getting issue on this platform. Even if HWP is disabled (which these MSRs are enabling), you still see the issue. Also this is happening for kernel as old as 4.15 kernel. You are seeing too many BUG for some wireless driver, can you fix that noise? May be rmmod the module and see if this has anything to do. Since we can't get logs after issue occurs, it is difficult to see where it was stuck. Try enabling CONFIG_MAGIC_SYSRQ. Check here and see if you can make the kernel respond after the freeze. https://www.kernel.org/doc/html/latest/admin-guide/sysrq.html
(In reply to Srinivas Pandruvada from comment #70) > This issue is happening of irrespective of intel_pstate driver. Enabling > turbo is getting issue on this platform. I agree that the root issue here appears to have nothing to the intel_pstate CPU frequency scaling driver, and everything to do with turbo enabled or disabled, data from attachment 277141 [details] (which maybe didn't run long enough) not withstanding.
Two issues to note here: - This laptop was sold with Ubuntu Linux as claimed. But I haven't seen any reports from Canonical or others. - Some version used to boot on this machine, can we try that version again? This will help if there is some HW issue on this platform lately. Also if we have to bisect we need to have some good version. Magic sys key request if works, then may help in debug.
(In reply to Srinivas Pandruvada from comment #70) > This issue is happening of irrespective of intel_pstate driver. Enabling > turbo is getting issue on this platform. Even if HWP is disabled (which > these MSRs are enabling), you still see the issue. Also this is happening > for kernel as old as 4.15 kernel. I see, thanks for your time and efforts working through to verify this; at least we know where the problem isn't right now - which is more than we knew before. > You are seeing too many BUG for some wireless driver, can you fix that > noise? May be rmmod the module and see if this has anything to do. I think these bugs are related to a recent switch to iwd from wpa_supplicant and should not be related to my freezing issues as the freezing occurred prior to installing iwd and continued after installing iwd. I can certainly switch off iwd and go back to wpa_supplicant if you think it's worthwhile. > Since we can't get logs after issue occurs, it is difficult to see where it > was stuck. > Try enabling CONFIG_MAGIC_SYSRQ. Check here and see if you can make the > kernel respond after the freeze. > https://www.kernel.org/doc/html/latest/admin-guide/sysrq.html I will work on implementing this here soon and report back with any findings. (In reply to Srinivas Pandruvada from comment #72) > Two issues to note here: > - This laptop was sold with Ubuntu Linux as claimed. But I haven't seen any > reports from Canonical or others. I know this particular build of the Precision 5510 occupied a gray area as most who would buy this machine would want an i7 (or at least an i5 with HT) along with a dedicated graphics card as it's marketed as a Workstation class laptop. I bought this specifically because I didn't want HT (hence the i5-6440HQ) and I didn't want an nvidia dedicated card to struggle with when using Linux. There may not have been that many of this particular build sold, and maybe even less where they're running Linux is my only guess. I have seen a small sample of anecdotal reports of freezing/hanging when searching across various fora, but it's hard to get a gauge on how prevalent it is or if it's related to this turbo issue. I do know there's been a lot of posts about the BIOS on this laptop being really buggy and most have reached the conclusion that Dell has decided not to dedicate any resources to addressing which is unfortunate to say the least. > - Some version used to boot on this machine, can we try that version again? > This will help if there is some HW issue on this platform lately. Also if we > have to bisect we need to have some good version. Magic sys key request if > works, then may help in debug. I can certainly try to do this, but this will be a massive undertaking. I can access older versions of the kernel using Arch Archive, but downgrading to say kernel 4.1x and all the other package downgrades that will need to happen along with that is more than a notion. I'll see what this entails though. I may be able to parse the journal to see if I can find clues as to when this started happening and then use that to base which version of the kernel to try.
(In reply to carbonchauvinist from comment #73) > I can certainly try to do this, but this will be a massive undertaking. I > can access older versions of the kernel using Arch Archive, but downgrading > to say kernel 4.1x and all the other package downgrades that will need to > happen along with that is more than a notion. I'll see what this entails > though. You shouldn't have to downgrade any other packages, just try the other kernels. I have kernels from 4.4 through 4.18-rc4 on my test computer right now. Finding a good starting point kernel for a bisection is not hard. The bisection itself is not hard, but very tedious and time consuming.
(In reply to Doug Smythies from comment #71) > (In reply to Srinivas Pandruvada from comment #70) > > This issue is happening of irrespective of intel_pstate driver. Enabling > > turbo is getting issue on this platform. > > I agree that the root issue here appears to have nothing to the intel_pstate > CPU frequency scaling driver, and everything to do with turbo enabled or > disabled, data from attachment 277141 [details] (which maybe didn't run long > enough) not withstanding. My confusion is, because the issue is with Turbo, and I most certainly was hitting Turbo ranges in the past regardless of the kernel version I was on, why would it have only manifested (semi-)recently? Might this have been introduced with a BIOS update? Is there any chance this issue is related to the buggy BIOS for this machine? (In reply to Doug Smythies from comment #74) > You shouldn't have to downgrade any other packages, just try the other > kernels. I have kernels from 4.4 through 4.18-rc4 on my test computer right > now. Finding a good starting point kernel for a bisection is not hard. The > bisection itself is not hard, but very tedious and time consuming. Thanks again for your time and assistance Doug. I will try to see if I can find anything in the journal that clues me into a good kernel version to start testing from and then will proceed as best I can. Essentially, I'll just be looking for the kernel version that allows me to boot regularly (i.e. without intel_pstate=disable) successfully?
(In reply to carbonchauvinist from comment #75) > My confusion is, because the issue is with Turbo, and I most certainly was > hitting Turbo ranges in the past regardless of the kernel version I was on, > why would it have only manifested (semi-)recently? Well, that is what we're trying to figure out. > Might this have been introduced with a BIOS update? I don't know. > Is there any chance this > issue is related to the buggy BIOS for this machine? I don't know. > Essentially, I'll just be looking for the kernel version that allows me to > boot regularly (i.e. without intel_pstate=disable) successfully? Whatever is your most reliable / probable indicator of the issue. If you want to continue with a kernel bisection, that could help. However, often (it seems) once one finishes a bisection, then uses a search engine with the commit number, then one finds information about the issue (from my own experiences, about 3/4 of the time). Prior to the bisection, often the exact correct search terms are not known. Another option is for you to simply disable turbo in your BIOS, and move on, checking new kernels every few months to see if the issue persists. i.e. leave it to somebody else to solve.
Created attachment 277327 [details] 4.18.13.-1 freeze on boot Well I went all the way back to kernel 4.8.13-1 which is from around the time I first got the machine and I'm still having freezes on boot and once booted. I went through the steps of enabling the magic syskey. The last couple of times i got a froze I tried to press the REISUB key sequence and got some output. Not sure if it's of use, couldn't transcribe or otherwise grab so took a pic.
So something went wrong in the system. You are getting issue on an old kernel. From attachment it seems that there was machine check error. These are error CPU throws if there is some HW error (some or recoverable some or not). This type of freeze will happen if an unrecoverable error happened. I am not expert on machine check, so there may be some code of actual error.
I was never able to recreate the machine check error I received when booting kernel 4.8.13-1; it happened only once and did not happen again afterwards. I tried numerous other kernel versions moving up from 4.8.13-1 and still got the system freezes, but no more of the machine check errors. I've taken Doug's advice and have simply turned off turbo in my BIOS - which has worked well enough for me at this point (though leaves a really bad taste in my mouth considering the amount of money spent on this laptop). There was a new BIOS released for this laptop recently (1.8.0 released 8/14/18) which I applied hoping this would help address my issue. However, after applying the new BIOS successfully, and turning back on Turbo in the BIOS, I booted into the current stable ARCH kernel (4.17.14.arch1-1) and when running a increased load test (compiling gcc5) I received a hang almost immediately upon starting the compile. I'll try reaching out to Dell Sputnik support to see if I get any recourse there. Otherwise, I'll continue to grit my teeth and run with turbo disabled in the BIOS.
@carbonchauvinist@protonmail.ch Do you still see this issue using the latest upstream kernel after one year? If yes, I'd suggested to test benchmark with turbo enabled, and with only CPU online(maxcpus=1), and with graphic disabled(modprobe.blacklist=i915). Or even to be more aggressive, you can rebuild the kernel with CONFIG_DRM not set. Then running stress-ng to check if the issue is still there.
@Chen Yu (In reply to Chen Yu from comment #80) > Do you still see this issue using the latest upstream kernel after one year? Thank you for your interest. Yes, unfortunately even now running the latest 5.3.13-arch1-1 kernel and issue is still present. At this point I'm unable to boot at all if I have turbo enabled (hangs on vendor logo, no flicker to indicate that ever changed out of initrd or that early kms has occurred). If disable pstate (intel_pstate=disable) I can boot with turbo enabled -- when this happens, once I'm logged into my DE it appears to have normal operation. However, I still experience complete freezes intermittently - especially when coming back from a suspend for instance. I just worked around all this by disabling turbo in the UEFI. > If yes, I'd suggested to test benchmark with turbo enabled, and with only > CPU online(maxcpus=1), and with graphic disabled(modprobe.blacklist=i915). Booting with maxcpus=1 and modprobe.blacklist=i915 and I was able to switch to a TTY, log in successfully and run stress-ng for 90 mins straight without issue! Are there specific flags you want me to run stress-ng with? Trying to change maxcpus to any number other than 1 and I am unable to boot again. Even when booting with modprobe.blacklist=i915 it does appear that i915 was loaded after all though? It shows up in lsmod when run in the same TTY I ran the stress-ng test in. Is the blacklist parameter supposed to stop it from being loaded in the initrd only? Or am I missing something? Also, when booting with maxcpus=1 and modprobe.blacklist=i915 I'm unable to log in using my DM (SDDM) as after entering the password I get a hang (though it's not a hard hang as I can press the power button and the system shuts down. I can't however switch to another TTY when this happens). I'm not sure if that's due to any of the parameters being passed - maybe you expect that to happen? > Or even to be more aggressive, you can rebuild the kernel with CONFIG_DRM > not set. > Then running stress-ng to check if the issue is still there. Do you still this is a valuable test based on the information above? I'll try to give this a go and report back.
*Do you still think this is a valuable test*
@Chen Yu Okay, so I've been able to test a little more. Here's more complete, and hopefully more useful, information: 1. When booting with turbo enabled, maxcpus=1, modprobe.blacklist=i915 and NOT disabling pstate driver -- I can successfully switch to a TTY log in and run stress-ng for long periods of time without issue (at least 90 mins). I am unable to log into a session via SDDM though as it hangs -- doesn't appear to be a full panic or anything as I'm able to power off the computer by simply pressing the power button and I can see the shutdown happen. When the session is hung after trying to log on via SDDM, I am unable to switch to other TTYs. 2. When booting with turbo enabled, maxcpus=2 through maxcpus=3, modprobe.blacklist=i915, and NOT disabling pstate driver -- I can successfully switch to a TTY log in and run stress-ng for long periods of time without issue. (I've run for an hour without issue with maxcpus=2 and maxcpus=3 respectively.) With maxcpus=2 through maxcpus=3 I am able to successfully log into a session via SDDM and I was able to suspend and resume without a hang/freeze (though I only tested the suspend cycle once with maxcpus=3). 3. When booting with turbo enabled, maxcpus=4, modprobe.blacklist=i915, and NOT disabling pstate driver -- I am unable to boot at all. I can't even get past the vendor logo as it hangs here. I don't even see a flicker, just the logo and then it immediately freezes.
(In reply to carbonchauvinist from comment #83) > @Chen Yu > > Okay, so I've been able to test a little more. Here's more complete, and > hopefully more useful, information: > > 1. When booting with turbo enabled, maxcpus=1, modprobe.blacklist=i915 and > NOT disabling pstate driver -- I can successfully switch to a TTY log in and > run stress-ng for long periods of time without issue (at least 90 mins). > > I am unable to log into a session via SDDM though as it hangs -- doesn't > appear to be a full panic or anything as I'm able to power off the computer > by simply pressing the power button and I can see the shutdown happen. When > the session is hung after trying to log on via SDDM, I am unable to switch > to other TTYs. > > 2. When booting with turbo enabled, maxcpus=2 through maxcpus=3, > modprobe.blacklist=i915, and NOT disabling pstate driver -- I can > successfully switch to a TTY log in and run stress-ng for long periods of > time without issue. (I've run for an hour without issue with maxcpus=2 and > maxcpus=3 respectively.) > > With maxcpus=2 through maxcpus=3 I am able to successfully log into a > session via SDDM and I was able to suspend and resume without a hang/freeze > (though I only tested the suspend cycle once with maxcpus=3). > > 3. When booting with turbo enabled, maxcpus=4, modprobe.blacklist=i915, and > NOT disabling pstate driver -- I am unable to boot at all. I can't even get > past the vendor logo as it hangs here. I don't even see a flicker, just the > logo and then it immediately freezes. Thanks very much for the information in detail. This appears to be a conflict between the graphic and high frequency. Could you boot with intel_pstate=disabled with turbo enabled in BIOS, and after boot into the system, enable the turbo manually by: # rdmsr -a 0x1a0 # 850089 check if bit38(turbo disabled) is set # rdmsr 0x1a0 -f 38:38 # 0 if the bit38 is 1, it means the turbo is disabled(it should be the case in your case), then enable the turbo by setting bit38 to 0 using wrmsr -a 0x1a0 xxx(xxx should be the value you read via rdmsr with bit38 cleared)
(In reply to Chen Yu from comment #84) > Thanks very much for the information in detail. This appears to be a > conflict between the graphic and high frequency. Could you boot with > intel_pstate=disabled with turbo enabled in BIOS, and after boot into the > system, enable the turbo manually by: > > # rdmsr -a 0x1a0 > # 850089 > check if bit38(turbo disabled) is set > # rdmsr 0x1a0 -f 38:38 > # 0 > if the bit38 is 1, it means the turbo is disabled(it should be the case in > your case), then enable the turbo by setting bit38 to 0 using > wrmsr -a 0x1a0 xxx(xxx should be the value you read via rdmsr with bit38 > cleared) Thank you. Just to be clear and I understand, you are recommending the following? 1. Enable turbo in BIOS 2. Boot with intel_pstate=disable When doing so here is the msr value as read: $ sudo rdmsr -a -x -0 0x1a0 0000000000850089 0000000000850089 0000000000850089 0000000000850089 Why do you think that turbo should be disabled in my case after steps 1 and 2 above? Are you assuming that I'm also disabling turbo via a kernel line boot parameter? Should I be? Here's the current cmdline I'm passing $ cat /proc/cmdline initrd=\intel-ucode.img initrd=\initramfs-dracut-linux.img root=UUID=4d59661c-0a7e-41ce-8d7b-6cdb367f6e91 rw sysrq_always_enabled=1 quiet intremap=off intel_pstate=disable And here's the stats when booted showing turbo is in fact enabled as I would expect (unless I'm missing something): $ type pstate && pstate pstate is aliased to `grep . /sys/devices/system/cpu/cpu*/cpufreq/scaling_{driver,governor} /sys/devices/system/cpu/intel_pstate/no_turbo /sys/devices/system/cpu/cpufreq/boost' /sys/devices/system/cpu/cpu0/cpufreq/scaling_driver:acpi-cpufreq /sys/devices/system/cpu/cpu1/cpufreq/scaling_driver:acpi-cpufreq /sys/devices/system/cpu/cpu2/cpufreq/scaling_driver:acpi-cpufreq /sys/devices/system/cpu/cpu3/cpufreq/scaling_driver:acpi-cpufreq /sys/devices/system/cpu/cpu0/cpufreq/scaling_governor:schedutil /sys/devices/system/cpu/cpu1/cpufreq/scaling_governor:schedutil /sys/devices/system/cpu/cpu2/cpufreq/scaling_governor:schedutil /sys/devices/system/cpu/cpu3/cpufreq/scaling_governor:schedutil grep: /sys/devices/system/cpu/intel_pstate/no_turbo: No such file or directory /sys/devices/system/cpu/cpufreq/boost:1 As a possible aside, I have since moved to dracut initramfs -- while I was using mkinitcpio when originally reporting the issue. Also please note in the past I was able to boot successfully when disabling pstate, but I would get hangs at times when trying to resume from suspends. According to Doug and Srinivas it was most likely a thermal event during those hangs.
(In reply to carbonchauvinist from comment #85) > (In reply to Chen Yu from comment #84) > > Thanks very much for the information in detail. This appears to be a > > conflict between the graphic and high frequency. Could you boot with > > intel_pstate=disabled with turbo enabled in BIOS, and after boot into the > > system, enable the turbo manually by: > > > > # rdmsr -a 0x1a0 > > # 850089 > > check if bit38(turbo disabled) is set > > # rdmsr 0x1a0 -f 38:38 > > # 0 > > if the bit38 is 1, it means the turbo is disabled(it should be the case in > > your case), then enable the turbo by setting bit38 to 0 using > > wrmsr -a 0x1a0 xxx(xxx should be the value you read via rdmsr with bit38 > > cleared) > > Thank you. Just to be clear and I understand, you are recommending the > following? > 1. Enable turbo in BIOS > 2. Boot with intel_pstate=disable > > When doing so here is the msr value as read: > > $ sudo rdmsr -a -x -0 0x1a0 > 0000000000850089 > 0000000000850089 > 0000000000850089 > 0000000000850089 > > Why do you think that turbo should be disabled in my case after steps 1 and > 2 above? Are you assuming that I'm also disabling turbo via a kernel line > boot parameter? Should I be? > It was a typo, I was intended to check the turbo impact, but after reading the thread again, it seems that we have already tested. So I'm trying to figure out what the result would be if we disable graphic. As you mentioned, the i915 was loaded anyway, could you check what is the dependency of i915 after the system has booted up? say, you can check by lsmod command, and there should be some dependency for i915, and we might need to unload them via boot command line,for example: modprobe.blacklist=i915,cec,video,i2c_algo_bit,drm_kms_helper,drm If the i915 is still loaded, the last try would be to disable CONFIG_DRM in the kernel config and re-compile the kernel. BTW, besides i915 driver, do you have other graphic driver loaded in your system? We should unload them as well.
Thanks for clarification. I'm willing to try to completely disabling graphic. $ lsmod | rg i915 i915 2617344 23 intel_gtt 24576 1 i915 i2c_algo_bit 16384 1 i915 drm_kms_helper 253952 1 i915 cec 69632 2 drm_kms_helper,i915 drm 581632 8 drm_kms_helper,i915 So I can try disabling {i915,intel_gtt,i2c_algo_bit,drm_kms_helper,cec,drm} IIUC. I'm only using i915 driver, this is an onboard video only system no other graphics. An additional data point, I booted in with intel_pstate disabled just now and was able to use the system for some time (~2 hrs) including a suspend and resume cycle and multiple short runs of stress-ng (~1 min duration x 4 or 5 times). However, not too long after the resume (~5 mins) when browsing a random website received a freeze. Booting back up after that freeze, took me twice for gdm to load (first time froze before gdm loaded). When gdm did load on the second boot it froze immediately - according to journal it was: Jul 13 20:27:55 lap kernel: #PF: supervisor read access in kernel mode Jul 13 20:27:55 lap kernel: BUG: kernel NULL pointer dereference, address: 0000000000000070 I'll attach that journal here next in case it's helpful.
Created attachment 290259 [details] intel_pstate=disable kernel 5.7.8-arch1-1 kernel null pointer dereference #pf supervisor read access
(In reply to carbonchauvinist from comment #87) > Thanks for clarification. I'm willing to try to completely disabling graphic. > > $ lsmod | rg i915 > > i915 2617344 23 > intel_gtt 24576 1 i915 > > i2c_algo_bit 16384 1 i915 > > drm_kms_helper 253952 1 i915 > > cec 69632 2 drm_kms_helper,i915 > > drm 581632 8 drm_kms_helper,i915 > > So I can try disabling {i915,intel_gtt,i2c_algo_bit,drm_kms_helper,cec,drm} > IIUC. > > I'm only using i915 driver, this is an onboard video only system no other > graphics. > > An additional data point, I booted in with intel_pstate disabled just now > and was able to use the system for some time (~2 hrs) including a suspend > and resume cycle and multiple short runs of stress-ng (~1 min duration x 4 > or 5 times). > Please help check if boot succeed with intel_pstate enabled(turbo enabled) + graphic disabled, to first narrow down if there's graphic confiliction. > However, not too long after the resume (~5 mins) when browsing a random > website received a freeze. The suspend resume issue might be another problem, let's stick to the boot issue first. > > Booting back up after that freeze, took me twice for gdm to load (first time > froze before gdm loaded). When gdm did load on the second boot it froze > immediately - according to journal it was: > > Jul 13 20:27:55 lap kernel: #PF: supervisor read access in kernel mode > > Jul 13 20:27:55 lap kernel: BUG: kernel NULL pointer dereference, address: > 0000000000000070 > > I'll attach that journal here next in case it's helpful. Seems to be another boot issue. BTW, is it possible to use the latest upstream kernel (by compiling from source code rather than using the arch linux image directly?)
(In reply to Chen Yu from comment #89) > Please help check if boot succeed with intel_pstate enabled(turbo enabled) + > graphic disabled, to first narrow down if there's graphic confiliction. I will try this and report back. > The suspend resume issue might be another problem, let's stick to the boot > issue first. I agree. > BTW, is it possible to use the latest > upstream kernel (by compiling from source code rather than using the arch > linux image directly?) Yes I can try to do this.
Created attachment 290291 [details] intel_pstate enabled; turbo enabled; graphic disabled; still kernel panic (In reply to Chen Yu from comment #89) > Please help check if boot succeed with intel_pstate enabled(turbo enabled) + > graphic disabled, to first narrow down if there's graphic confiliction. I was able to completely disable i915 module from loading. To do so I needed more than just the kernel command line parameter of `modprobe_blacklist=i915,cec....` I also had to add the following to /etc/modprobe.d/modprobe.conf and make sure it was included in my dracut initramfs install i915 /bin/true install intel_gtt /bin/true Trying to simply blacklist by using blacklist i915 blacklist intel_gtt in the modprobe.conf file did NOT stop the module from loading. Even with the i915 module NOT loaded and intel_pstate enabled I would only successfully get to a login prompt maybe 1 out of 10 times. The other 9 out of 10 times (approx.) would either freeze somewhere in the initramfs, before it got to the initramfs (i.e. ACPI errors were shown and then a freeze) or immediately at the dell vendor logo (i.e. no ACPI errors were printed to screen just the logo and an immediate freeze). If I did get the login to load and logged in, I would also get freezes as well though the timing would vary from just a few seconds to up to 10 minutes before a freeze. This picture captures one of those freezes (sorry no other way to capture the output of the freeze). As you can see i915 is not loaded (`lsmod | rg i915` returns nothing). I then cat the modprobe.conf file to show how I blacklisted the modules. Then just trying to run a random command - in this case using `fwupdmgr get-updates` - caused a kernel panic after spitting some ACPI errors out. This is with the Arch kernel though - do you still think it is useful to try the latest upstream kernel as well? (Do you mean the latest STABLE upstream? i.e 5.7.8 Or just the latest upstream mainline i.e. 5.8.rc-5? And also would this just be the default config when compiling?) Thanks again for your time in this - it's been really frustrating having to hobble this machine by disabling turbo. Disabling turbo in the BIOS though makes the machine runs like a dream with no hangs or freezes or panics.
I think he means kernel 5.8.rc-5 (well -rc6 now). Use the kernel configuration from your kernel 5.7.8-arch1-1 kernel, and just allow defaults from there, if required. Since I thought we eliminated the intel_pstate CPU frequency scaling driver as an issue for this bug report, I don't really pay attention anymore. While I can not speak for Srinivas, that might be the case for him also.
This is really a long thread, and it is a long time since last update. Just want to confirm if my understanding of this problem is correct or not, 1. problem can not be reproduced with Turbo disabled in BIOS 2. problem can not be reproduced with Turbo disabled via kernel command line, using srinivas' patch. 3. problem can not be reproduced with powersave governor + passive mode And all these suggests that there is a racing/thermal issue during boot that causes the failure. I'm not sure how likely this is a racing issue because if it is, it should also be reproducible on other platforms as well. But Srinivas has the same model of laptop and could not reproduce the problem. Thus I suspect this might be a unit failure on this machine only. carbonchauvinist@protonmail.ch can you please try the latest upstream and see if the symptom changes or not, and can you please remove the kernel commandlines like "quiet", "splash", etc, and then recapture the kernel panic screenshot? I'm not sure if this kernel panic is related with this problem or not.
Created attachment 296951 [details] 5.13.0-rc2-1-mainline - mce when booting with turbo enabled (In reply to Zhang Rui from comment #93) > This is really a long thread, and it is a long time since last update. > > Just want to confirm if my understanding of this problem is correct or not, > 1. problem can not be reproduced with Turbo disabled in BIOS Yes that is correct. > 2. problem can not be reproduced with Turbo disabled via kernel command > line, using srinivas' patch. Yes, that is also correct. > 3. problem can not be reproduced with powersave governor + passive mode I was unable to reproduce the issue with powersave governor + passive mode. > And all these suggests that there is a racing/thermal issue during boot that > causes the failure. > > I'm not sure how likely this is a racing issue because if it is, it should > also be reproducible on other platforms as well. But Srinivas has the same > model of laptop and could not reproduce the problem. > > Thus I suspect this might be a unit failure on this machine only. That honestly makes sense, especially if Srinivas has the exact same model and is unable to reproduce. > carbonchauvinist@protonmail.ch > > can you please try the latest upstream and see if the symptom changes or > not, and can you please remove the kernel commandlines like "quiet", > "splash", etc, and then recapture the kernel panic screenshot? > I'm not sure if this kernel panic is related with this problem or not. I installed the latest upstream as of today (5.13.0-rc2-1-mainline) booted without quiet kernel parameters and turbo re-enabled from the BIOS and sure enough first three boots froze. The fourth boot I was able to make it to gdm and login to gnome de. Here's the picture captured showing the errors "mce: CPUs not responding to MCE broadcast (may include false positives): 0" Interestingly, perhaps, that's only listed three times though I have a four-core proc. I do remember that if I booted with only three cores enabled I was able to boot with turbo enabled. So perhaps this is further evidence of unit failure for this machine only (affecting one core?)? Let me know your thoughts, otherwise I'm willing to close this issue out (with tremendous thanks to all who assisted in troubleshooting for their valuable time and expertise).
(In reply to carbonchauvinist from comment #94) > > > > I'm not sure how likely this is a racing issue because if it is, it should > > also be reproducible on other platforms as well. But Srinivas has the same > > model of laptop and could not reproduce the problem. > > > > Thus I suspect this might be a unit failure on this machine only. > > That honestly makes sense, especially if Srinivas has the exact same model > and is unable to reproduce. > > > carbonchauvinist@protonmail.ch > > > > can you please try the latest upstream and see if the symptom changes or > > not, and can you please remove the kernel commandlines like "quiet", > > "splash", etc, and then recapture the kernel panic screenshot? > > I'm not sure if this kernel panic is related with this problem or not. > > I installed the latest upstream as of today (5.13.0-rc2-1-mainline) booted > without quiet kernel parameters and turbo re-enabled from the BIOS and sure > enough first three boots froze. The fourth boot I was able to make it to gdm > and login to gnome de. > > Here's the picture captured showing the errors "mce: CPUs not responding to > MCE broadcast (may include false positives): 0" > > Interestingly, perhaps, that's only listed three times though I have a > four-core proc. I do remember that if I booted with only three cores enabled > I was able to boot with turbo enabled. So perhaps this is further evidence > of unit failure for this machine only (affecting one core?)? > > Let me know your thoughts, otherwise I'm willing to close this issue out > (with tremendous thanks to all who assisted in troubleshooting for their > valuable time and expertise). Thanks for the update, carbonchauvinist. I agree we can close this bug report. I don't have any further thoughts for this problems, from OS' perspective of view.