Bug 188461 - intel_idle.max_cstate=1 requiredto prevent boot failure - Haswell - Intel(R) Core(TM) i7-4790K CPU
Summary: intel_idle.max_cstate=1 requiredto prevent boot failure - Haswell - Intel(R) ...
Status: CLOSED UNREPRODUCIBLE
Alias: None
Product: Power Management
Classification: Unclassified
Component: intel_idle (show other bugs)
Hardware: Intel Linux
: P1 normal
Assignee: Len Brown
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2016-11-24 06:41 UTC by brendan_os
Modified: 2018-09-27 07:06 UTC (History)
4 users (show)

See Also:
Kernel Version: 4.8.10-1.gd1ec066-default
Subsystem:
Regression: No
Bisected commit-id:


Attachments
Dmesg for intel_idle.max_cstate=1 (15.60 KB, application/gzip)
2017-03-13 21:08 UTC, brendan_os
Details

Description brendan_os 2016-11-24 06:41:40 UTC
This is an upstream resubmit of this bug report for opensuse:
https://bugzilla.opensuse.org/show_bug.cgi?id=1011254

My computer will not boot unless acpi=off is set in boot parameters for OpenSuSE Leap 42.2 (I have tried acpi=ht, acpi=off and acpi=noirq).

I have also tried booting these other kernels (they also fail to boot):

kernel-default-4.1.35-12.1.g2e75991.x86_64.rpm  and
kernel-default-4.8.10-1.1.gd1ec066.x86_64.rpm

The system will attempt to load, then reset (screen blanks, computer power cycles). The reset always occurs, but doesn't always occur at the same "time". It can be as little as a second or two from loading initrd up to about 6 seconds (based on readouts I can glimpse with quiet off). 

I have updated my motherboard with the most recent bios (E7918IMS.2A0).

This other bug may, or may not be relevant:
https://bugzilla.opensuse.org/show_bug.cgi?id=990003

The Opensuse ppl suggested lodging this bug upstream. 

Thanks
Comment 1 brendan_os 2016-11-25 12:31:14 UTC
I have done many, many tests with kernels from system rescue cds. 

None of the kernels I tried would boot and stay stable without extra parameters. By stable I mean, did not reset/reboot automatically[1]. More recent kernels would boot neat (ie no parameters) but would reset on a right mouse click/menu operation. 

However, the following kernels would all boot and be stable with nohz=off parameter:
3.10.25
3.10.32
3.10.55
3.12.7

The following needed acpi=off to boot and be stable. That is, nohz=off was not enough to ensure a booted, stable system:
3.13.5
3.14.20
3.18.34
4.1.27


In one case (3.13.5) nohz=off gave a _stable system on alternate reboots_ first boot unstable reboot stable, reboot unstable etc (rebooted  7 times).

Something seems to have happened between 3.12.7 and 3.13.5. Up to 3.12.7 nohz=off was enough to give me a stable system. From 3.13.5, I needed to use acpi=off instead to get a stable system.

Note:
1: to test stability I typed uname -r into the console and ran the internet browser, right clicked to get a menu, opened the applications menu. Typically it would reboot within seconds of either a console or X starting (the usb boots to a command line. You then startx).  If the system made it to X, right clicking the mouse seemed to trigger a reset.   

Could someone take a look at this please?
Comment 2 Zhang Rui 2016-11-28 08:53:40 UTC
does the problem still exist if you use boot option idle=poll?
Comment 3 brendan_os 2016-11-28 11:40:25 UTC
If I use idle=poll the system will boot and seems to be stable (40 minutes and counting). 

The system fan runs noticably louder, although system load seems small. 

Tested:

Leap 42.2 (kernel 4.8.10-1.gdlec066-default) (3 times - ie booted, seemed stable, tried again anyway)
Sys rescue cd 4.8.0 (kernel 4.4.28) (3 times)
Sysrescue cd 4.0.0 (kernel 3.10.25)
Comment 4 Zhang Rui 2016-11-29 08:35:55 UTC
then what about idle=nomwait?
Comment 5 Zhang Rui 2016-11-29 08:38:36 UTC
and what about intel_idle.max_cstate=1
Comment 6 brendan_os 2016-11-29 11:11:13 UTC
idle=nomwait:
will not boot (fails after a few seconds, well before X loads)

intel_idle.max_cstate=1:
boots, seems stable. 

(kernel: 4.8.10)
Comment 7 Zhang Rui 2016-11-30 02:02:01 UTC
then what about intel_idle.max_cstate=2 or 3?
Comment 8 brendan_os 2016-11-30 02:26:36 UTC
3: fails to boot

2: seems to be stable
Comment 9 brendan_os 2016-12-21 07:13:52 UTC
Where can I find out whether it's better to boot with intel_idle.max_cstate=2 vs acpi=off?
Comment 10 Zhang Rui 2016-12-22 00:37:57 UTC
it's better to use intel_idle.max_cstate=2, which has ACPI enabled.
Comment 11 brendan_os 2017-02-02 23:40:07 UTC
with intel_idle.max_cstate=2 I am getting random reboots.
They occur infrequently (once a week to once a fortnight). I thought they might be related to my corsair mouse, but they occur without the mouse present.
Should I open a new bug report?
Comment 12 Zhang Rui 2017-03-13 05:54:47 UTC
please attach the dmesg output when boot with intel_idle.max_cstate=1
Comment 13 brendan_os 2017-03-13 21:08:29 UTC
Created attachment 255223 [details]
Dmesg for intel_idle.max_cstate=1
Comment 14 brendan_os 2017-03-13 21:08:56 UTC
Attached
Comment 15 brendan_os 2017-03-13 21:10:14 UTC
Actually, the reboots are more frequent. Might be using the computer more. However, probably once every day or two. Sometimes more than once a day.
Comment 16 Zhang Rui 2017-03-14 07:16:34 UTC
(In reply to brendan_os from comment #15)
> Actually, the reboots are more frequent. Might be using the computer more.
> However, probably once every day or two. Sometimes more than once a day.

what do you mean?
I thought intel_idle.max_cstate=1 would be sufficient to stop the reboot issue, no?
Comment 17 brendan_os 2017-03-14 08:24:44 UTC
No, when max_cstate=2 I'm getting random reboots. The screen goes black without warning and the machine restarts - see comment 11.
Comment 18 Zhang Rui 2017-03-27 04:05:38 UTC
(In reply to brendan_os from comment #17)
> No, when max_cstate=2 I'm getting random reboots. The screen goes black
> without warning and the machine restarts - see comment 11.

and with intel_idle.max_cstate=1, the problem never exist, right?
Comment 19 brendan_os 2017-03-27 07:28:52 UTC
Yes. I have now edited the defaults to be cstate=1. If I have reboots I will post.
Comment 20 Zhang Rui 2017-06-17 07:31:21 UTC
[    0.067784] smpboot: CPU0: Intel(R) Core(TM) i7-4790K CPU @ 4.00GHz (family: 0x6, model: 0x3c, stepping: 0x3)

I guess this is a Haswell processor, right?
please attach the lspci output to confirm.
Comment 21 brendan_os 2017-06-17 11:38:41 UTC
>/sbin/lspci
00:00.0 Host bridge: Intel Corporation 4th Gen Core Processor DRAM Controller (rev 06)
00:02.0 VGA compatible controller: Intel Corporation Xeon E3-1200 v3/4th Gen Core Processor Integrated Graphics Controller (rev 06)
00:03.0 Audio device: Intel Corporation Xeon E3-1200 v3/4th Gen Core Processor HD Audio Controller (rev 06)
00:14.0 USB controller: Intel Corporation 9 Series Chipset Family USB xHCI Controller
00:16.0 Communication controller: Intel Corporation 9 Series Chipset Family ME Interface #1
00:1a.0 USB controller: Intel Corporation 9 Series Chipset Family USB EHCI Controller #2
00:1b.0 Audio device: Intel Corporation 9 Series Chipset Family HD Audio Controller
00:1c.0 PCI bridge: Intel Corporation 9 Series Chipset Family PCI Express Root Port 1 (rev d0)
00:1c.2 PCI bridge: Intel Corporation 9 Series Chipset Family PCI Express Root Port 3 (rev d0)
00:1c.3 PCI bridge: Intel Corporation 82801 PCI Bridge (rev d0)
00:1d.0 USB controller: Intel Corporation 9 Series Chipset Family USB EHCI Controller #1
00:1f.0 ISA bridge: Intel Corporation 9 Series Chipset Family Z97 LPC Controller
00:1f.2 SATA controller: Intel Corporation 9 Series Chipset Family SATA Controller [AHCI Mode]
00:1f.3 SMBus: Intel Corporation 9 Series Chipset Family SMBus Controller
02:00.0 Ethernet controller: Qualcomm Atheros Killer E220x Gigabit Ethernet Controller (rev 13)
03:00.0 PCI bridge: ASMedia Technology Inc. ASM1083/1085 PCIe to PCI Bridge (rev 03)
Comment 22 Zhang Rui 2017-06-19 03:13:38 UTC
I didn't get any useful information from the lspci output, but anyway, according to

[    0.067784] smpboot: CPU0: Intel(R) Core(TM) i7-4790K CPU @ 4.00GHz (family: 0x6, model: 0x3c, stepping: 0x3)
and
#define INTEL_FAM6_HASWELL_CORE         0x3C

this should be a HSW platform.
Comment 23 Len Brown 2017-06-19 23:14:46 UTC
please reboot into BIOS SETUP and make sure that the system is using SETUP *DEFAULTS* and is not overclocked.

you might also consider booting into memtest and running that overnight

Another thing to try may be in the BIOS SETUP, disabling high frequency performance states.

in general, this looks like an electrical hardware problem, rather than a Linux kernel issue.
Comment 24 brendan_os 2017-06-21 04:38:59 UTC
I'm not overclocking. Everything is auto.  
The system was fine up to and including kernel 3.11.10-34.1 
see https://bugzilla.opensuse.org/show_bug.cgi?id=990003

After that kernel I would need to use nohz=off to boot (OpenSuSE 13.1). I tried staying on that kernel for as long as possible, then tried to upgrade to leap, hoping a newer kernel (v 4+) would fix it. When I upgraded to OpenSuSE Leap 42.2 it wouldn't boot without acpi=off / intel_idle.max_cstate=1

I've assumed that since it works with the older kernel it's not a hardware problem? Is that fair?

I will need to research what to disable re high frequency performance states. I think the BIOS has intel turbo boost set.
Comment 25 brendan_os 2018-02-23 00:33:50 UTC
Looks like the hardware problem assessment was correct. I have replaced my power supply and can now boot with the kernel's default boot options. I have run the machine for several hours and no random reboots.
Comment 26 Zhang Rui 2018-09-27 07:06:27 UTC
good to know. Bug closed

Note You need to log in before you can comment on or make changes to this bug.