Bug 65871
Summary: | Boot panic unless intel_idle.max_cstate=3 disables ATM-C6 - Atom Z530 (Poulsbo) | ||
---|---|---|---|
Product: | Power Management | Reporter: | timo (kernel_tk) |
Component: | intel_idle | Assignee: | Len Brown (lenb) |
Status: | CLOSED DOCUMENTED | ||
Severity: | normal | CC: | aaron.lu, lenb, rjw, tianyu.lan |
Priority: | P1 | ||
Hardware: | i386 | ||
OS: | Linux | ||
Kernel Version: | 3.10.20, 3.12.1 | Subsystem: | |
Regression: | Yes | Bisected commit-id: | |
Attachments: |
kernel 3.10.20 log (panic)
kernel 3.10.20 log (acpi=off, works) lspci -vvvnn acpidump |
Description
timo
2013-11-26 13:10:30 UTC
Created attachment 116181 [details]
kernel 3.10.20 log (acpi=off, works)
Created attachment 116191 [details]
lspci -vvvnn
Created attachment 116201 [details]
acpidump
Please test an upstream kernel: https://www.kernel.org BTW, is there a working kernel or you always need to specify acpi=off? Exact same issue with kernel 3.12.1, log files almost identical to 3.10.20. Works with "acpi=off"
> Kernel panic - not syncing: Watchdog detected hard LOCKUP on cpu 0
> CPU: 0 PID: 0 Comm: swapper/0 Not tainted 3.12.1-1.el6.elrepo.i686 #1
> c10d7b10 c17c1d34 c155ac54 f58b0800 c17c1d58 c155aa23 c170bc44 c1913840
> 00000000 8baebcb9 f58b0800 c10d7b10 00000000 c17c1d70 c10d7bd6 c170fda8
> 00000000 01085f12 39241a2c c17c1dc4 c110b57f 7fffffff f71bc3f0 c17c1d90
> Call Trace:
> [<c10d7b10>] ? watchdog_nmi_enable+0x150/0x150
> [<c155ac54>] dump_stack+0x41/0x55
> [<c155aa23>] panic+0x87/0x199
> [<c10d7b10>] ? watchdog_nmi_enable+0x150/0x150
> [<c10d7bd6>] watchdog_overflow_callback+0xc6/0xd0
> [<c110b57f>] __perf_event_overflow+0xaf/0x280
> [<c1022bf4>] ? x86_perf_event_set_period+0x134/0x1f0
> [<c110c015>] perf_event_overflow+0x15/0x20
> [<c1029e06>] intel_pmu_handle_irq+0x1d6/0x3b0
> [<c1085f12>] ? sched_clock_local+0xb2/0x190
> [<c1078707>] ? hrtimer_start+0x27/0x30
> [<c10b5902>] ? tick_nohz_stop_sched_tick+0x2e2/0x340
> [<c1560551>] perf_event_nmi_handler+0x31/0x50
> [<c155fc62>] nmi_handle+0x52/0x1a0
> [<c10accd8>] ? ktime_get+0x48/0x100
> [<c10b3fb5>] ? tick_broadcast_oneshot_control+0x85/0x1b0
> [<c155fe93>] default_do_nmi+0x43/0x280
> [<c1058de0>] ? ns_to_timespec+0x40/0x60
> [<c1560184>] do_nmi+0xb4/0xf0
> [<c155f34b>] nmi_stack_correct+0x2f/0x34
> [<c10a3e6c>] ? cpu_idle_loop+0xec/0x1d0
> [<c10a3fb6>] cpu_startup_entry+0x66/0x70
> [<c15545d7>] rest_init+0x67/0x70
> [<c184dc7a>] start_kernel+0x3cd/0x3d3
> [<c184d71e>] ? repair_env_string+0x5b/0x5b
> [<c184d396>] i386_start_kernel+0x139/0x13c
With kernel 2.6.32 the panic does not occur, "acpi=off" is not needed.
This is a regression. Could you git bisect to find which commit cause this issue? I could test some binary kernels if they were available. The elrepo seems to provide only the latest kernel. Do you know if there is an archive? please try booting with "idle=poll" also, does it make any difference if you boot with maxcpus=0? kernel 3.12.1 "acpi=off" works (no other parameter required) "idle=poll" works (no other parameter required) "maxcpus=0" works (no other parameter required) "maxcpus=1" doesn't work: Watchdog detected hard LOCKUP on cpu 0 Please try booting with "processor.max_cstate=1" If that works, please try again, increase the '1' until it fails "processor.max_cstate=1" doesn't make any difference (3.12.1 and 3.10.20). timo: Did it actually work with any kernels earlier that 3.10 and if so, which kernel is the last known good one? For now we know that 2.6.32 was working, but that's a bit too far into the past. Re: comment #1 -- dmesg from working case > NMI watchdog: enabled on all CPUs, permanently consumes one hw-PMU counter. What happens if you boot with nmi_watchdog=0 Also, I assume that when this fails, it fails during boot, and it fails 100% of the time? scsi0 : pata_sch scsi1 : pata_sch ata1: PATA max UDMA/100 cmd 0x1f0 ctl 0x3f6 bmdma 0xffa0 irq 14 ata2: PATA max UDMA/100 cmd 0x170 ctl 0x376 bmdma 0xffa8 irq 15 Kernel panic - not syncing: Watchdog detected hard LOCKUP on cpu 0 Is the panic always immediately after probing the PATA controllers? Are the 3.12.1 results from an upstream kernel build from source, or a binary supplied by a distro? (I didn't see dmesg from them) Was the 2.6.32 success case an upstream kernel, or a distro binary? FWIW, I located a poulsbo netbook, installed Fedora Core 20 and did not see this issue. FC20's kernel is 3.11.10: 3.11.10-301.fc20.i686+PAE Then I snagged the vanilla upstream kernels for fedora from https://fedoraproject.org/wiki/Kernel_Vanilla_Repositories and they worked too: 3.14.3-200.vanilla.stable.knurd.1.fc20.i686 3.15.0-0.rc3.git0.1.vanilla.mainline.knurd.1.fc20.i686 If this issue can not be reproduced using an un-modified upstream kernel, I'm inclined to close this as not an upstream Linux kernel issue. Further, if it can't be reproduced on another Poulsbo box, then it could also be be a sample defect specific to the unit that timo has on hand. The last known good kernel is the latest kernel used in CentOS 6: 2.6.32-431.17.1.el6.i686 Previous CentOS 6 kernels also worked. Tested with 3.14.4-1.el6.elrepo.i686: Same issue as described, i.e. kernel panic unless idle=poll. nmi_watchdog=0 doesn't work either, leads to kernel panic. The issue is 100% reproducible. 3.14.4-1.el6.elrepo.i686 doesn't look like a mainline kernel however. Are you able to test a mainline kernel? It's the mainline kernel from the elrepo-kernel repository. Here's the description:
> http://elrepo.org/tiki/kernel-ml
As the installation is CentOS 6 the easiest way to test the latest kernel without recompiling seems to be the elrepo.
thanks for confirming that 3.14.4-1.el6.elrepo.i686 still fails. That is based on the latest stable kernel, which is a great reference. does the 2.6.32-431.17.1.el6.i686 kernel still work? > "maxcpus=0" works (no other parameter required)
> "maxcpus=1" doesn't work: Watchdog detected hard LOCKUP on cpu 0
These are both uni-processor options, but maxcpus=0
also disables the ioapic.
What if you boot with just "noapic"?
re: comment #11 processor.max_cstate=1 makes no difference (but idle=poll does) cat /sys/devices/system/cpu/cpuidle/current_driver If it says "intel_idle", then what you want to try instead here is "intel_idle.max_cstate=1" and increase the 1 until it fails. you can see what C-states are available with grep . /sys/devices/system/cpu/cpu0/cpuidle/*/* 2.6.32-431.17.1.el6.i686 still works.
3.14.4: With "noapic" it hangs at this point (no more messages, no panic):
> [drm] Supports vblank timestamp caching Rev 2 (21.10.2013).
> [drm] No driver support for vblank timestamp query.
> gma500 0000:00:02.0: trying to get vblank count for disabled pipe 1
> gma500 0000:00:02.0: trying to get vblank count for disabled pipe 1
> gma500 0000:00:02.0: Backlight lvds set brightness 7a120000
> [drm] Initialized gma500 1.0.0 2011-06-06 for 0000:00:02.0 on minor 0
> dracut: Starting plymouth daemon
> scsi0 : pata_sch
> scsi1 : pata_sch
> ata1: PATA max UDMA/100 cmd 0x1f0 ctl 0x3f6 bmdma 0xffa0 irq 14
> ata2: PATA max UDMA/100 cmd 0x170 ctl 0x376 bmdma 0xffa8 irq 15
2.6.32-431: > [root@tp57 ~]# cat /sys/devices/system/cpu/cpuidle/current_driver > intel_idle > [root@tp57 ~]# grep . /sys/devices/system/cpu/cpu0/cpuidle/*/* > /sys/devices/system/cpu/cpu0/cpuidle/state0/desc:CPUIDLE CORE POLL IDLE > /sys/devices/system/cpu/cpu0/cpuidle/state0/latency:0 > /sys/devices/system/cpu/cpu0/cpuidle/state0/name:C0 > /sys/devices/system/cpu/cpu0/cpuidle/state0/power:4294967295 > /sys/devices/system/cpu/cpu0/cpuidle/state0/time:0 > /sys/devices/system/cpu/cpu0/cpuidle/state0/usage:0 > /sys/devices/system/cpu/cpu0/cpuidle/state1/desc:MWAIT 0x00 > /sys/devices/system/cpu/cpu0/cpuidle/state1/latency:1 > /sys/devices/system/cpu/cpu0/cpuidle/state1/name:ATM-C1 > /sys/devices/system/cpu/cpu0/cpuidle/state1/power:1000 > /sys/devices/system/cpu/cpu0/cpuidle/state1/time:0 > /sys/devices/system/cpu/cpu0/cpuidle/state1/usage:0 > /sys/devices/system/cpu/cpu0/cpuidle/state2/desc:MWAIT 0x10 > /sys/devices/system/cpu/cpu0/cpuidle/state2/latency:20 > /sys/devices/system/cpu/cpu0/cpuidle/state2/name:ATM-C2 > /sys/devices/system/cpu/cpu0/cpuidle/state2/power:500 > /sys/devices/system/cpu/cpu0/cpuidle/state2/time:0 > /sys/devices/system/cpu/cpu0/cpuidle/state2/usage:0 > /sys/devices/system/cpu/cpu0/cpuidle/state3/desc:MWAIT 0x30 > /sys/devices/system/cpu/cpu0/cpuidle/state3/latency:100 > /sys/devices/system/cpu/cpu0/cpuidle/state3/name:ATM-C4 > /sys/devices/system/cpu/cpu0/cpuidle/state3/power:250 > /sys/devices/system/cpu/cpu0/cpuidle/state3/time:6932 > /sys/devices/system/cpu/cpu0/cpuidle/state3/usage:60526 3.14.4: intel_idle.max_cstate=1 works intel_idle.max_cstate=2 works intel_idle.max_cstate=3 works intel_idle.max_cstate=4 Kernel panic: Watchdog detected hard LOCKUP on cpu 0 intel_idle.max_cstate=4 disables ATM-C6 on this system. Re: comment #25 sysfs output I don't know why intel_idle in 2.6.32 doesn't export ATM-C6. But intel_idle didn't exist in upstream 2.6.32, so you're running a distro back-port in that kernel. I don't know why maxcpus=0 works and maxcpus=1 doesn't. Can you verify that "maxcpus=1 noapic" works? In theory, that is the same as "maxcpus=0". On that successful boot with maxcpus=0, please show grep . /sys/devices/system/cpu/cpu0/cpuidle/*/* to verify that we are indeed, using ATM-C6 in that configuration. Please boot 3.14.stable with "intel_idle.max_cstate=0", which will disable the intel_idle driver completely, and the acpi_idle drive should load. Then show this: dmesg | grep idle grep . /sys/devices/system/cpu/cpu0/cpuidle/*/* Presumably this configuration will work, presumably ACPI is not exporting ATM-C6 on this box because it has some issue with the ATM-C6 state. please attach the output from dmidecode also please try booting with "nolapic_timer" 3.4.14: "maxcpus=1 noapic": works (no panic but hangs later. Probably a different issue, see messages below). "maxcpus=0": > [root@tp57 ~]# grep . /sys/devices/system/cpu/cpu0/cpuidle/*/* > /sys/devices/system/cpu/cpu0/cpuidle/state0/desc:CPUIDLE CORE POLL IDLE > /sys/devices/system/cpu/cpu0/cpuidle/state0/disable:0 > /sys/devices/system/cpu/cpu0/cpuidle/state0/latency:0 > /sys/devices/system/cpu/cpu0/cpuidle/state0/name:POLL > /sys/devices/system/cpu/cpu0/cpuidle/state0/power:4294967295 > /sys/devices/system/cpu/cpu0/cpuidle/state0/time:585 > /sys/devices/system/cpu/cpu0/cpuidle/state0/usage:8 > /sys/devices/system/cpu/cpu0/cpuidle/state1/desc:MWAIT 0x00 > /sys/devices/system/cpu/cpu0/cpuidle/state1/disable:0 > /sys/devices/system/cpu/cpu0/cpuidle/state1/latency:10 > /sys/devices/system/cpu/cpu0/cpuidle/state1/name:C1E-ATM > /sys/devices/system/cpu/cpu0/cpuidle/state1/power:0 > /sys/devices/system/cpu/cpu0/cpuidle/state1/time:407651 > /sys/devices/system/cpu/cpu0/cpuidle/state1/usage:1122 > /sys/devices/system/cpu/cpu0/cpuidle/state2/desc:MWAIT 0x10 > /sys/devices/system/cpu/cpu0/cpuidle/state2/disable:0 > /sys/devices/system/cpu/cpu0/cpuidle/state2/latency:20 > /sys/devices/system/cpu/cpu0/cpuidle/state2/name:C2-ATM > /sys/devices/system/cpu/cpu0/cpuidle/state2/power:0 > /sys/devices/system/cpu/cpu0/cpuidle/state2/time:2034764 > /sys/devices/system/cpu/cpu0/cpuidle/state2/usage:2083 > /sys/devices/system/cpu/cpu0/cpuidle/state3/desc:MWAIT 0x30 > /sys/devices/system/cpu/cpu0/cpuidle/state3/disable:0 > /sys/devices/system/cpu/cpu0/cpuidle/state3/latency:100 > /sys/devices/system/cpu/cpu0/cpuidle/state3/name:C4-ATM > /sys/devices/system/cpu/cpu0/cpuidle/state3/power:0 > /sys/devices/system/cpu/cpu0/cpuidle/state3/time:480765 > /sys/devices/system/cpu/cpu0/cpuidle/state3/usage:183 > /sys/devices/system/cpu/cpu0/cpuidle/state4/desc:MWAIT 0x52 > /sys/devices/system/cpu/cpu0/cpuidle/state4/disable:0 > /sys/devices/system/cpu/cpu0/cpuidle/state4/latency:140 > /sys/devices/system/cpu/cpu0/cpuidle/state4/name:C6-ATM > /sys/devices/system/cpu/cpu0/cpuidle/state4/power:0 > /sys/devices/system/cpu/cpu0/cpuidle/state4/time:42208090 > /sys/devices/system/cpu/cpu0/cpuidle/state4/usage:2374 "intel_idle.max_cstate=0": > [root@tp57 ~]# dmesg | grep idle > Kernel command line: ro root=UUID=50a45a92-026f-4a1d-ba48-adbb43a5be5d > console=ttyS0,115200n8 intel_idle.max_cstate=0 > cpuidle: using governor ladder > cpuidle: using governor menu > intel_idle: disabled > tsc: Marking TSC unstable due to TSC halts in idle > ACPI: acpi_idle registered with cpuidle > [root@tp57 ~]# grep . /sys/devices/system/cpu/cpu0/cpuidle/*/* > /sys/devices/system/cpu/cpu0/cpuidle/state0/desc:CPUIDLE CORE POLL IDLE > /sys/devices/system/cpu/cpu0/cpuidle/state0/disable:0 > /sys/devices/system/cpu/cpu0/cpuidle/state0/latency:0 > /sys/devices/system/cpu/cpu0/cpuidle/state0/name:POLL > /sys/devices/system/cpu/cpu0/cpuidle/state0/power:4294967295 > /sys/devices/system/cpu/cpu0/cpuidle/state0/time:15367 > /sys/devices/system/cpu/cpu0/cpuidle/state0/usage:11 > /sys/devices/system/cpu/cpu0/cpuidle/state1/desc:ACPI FFH INTEL MWAIT 0x0 > /sys/devices/system/cpu/cpu0/cpuidle/state1/disable:0 > /sys/devices/system/cpu/cpu0/cpuidle/state1/latency:1 > /sys/devices/system/cpu/cpu0/cpuidle/state1/name:C1 > /sys/devices/system/cpu/cpu0/cpuidle/state1/power:0 > /sys/devices/system/cpu/cpu0/cpuidle/state1/time:462644 > /sys/devices/system/cpu/cpu0/cpuidle/state1/usage:1474 > /sys/devices/system/cpu/cpu0/cpuidle/state2/desc:ACPI FFH INTEL MWAIT 0x10 > /sys/devices/system/cpu/cpu0/cpuidle/state2/disable:0 > /sys/devices/system/cpu/cpu0/cpuidle/state2/latency:20 > /sys/devices/system/cpu/cpu0/cpuidle/state2/name:C2 > /sys/devices/system/cpu/cpu0/cpuidle/state2/power:0 > /sys/devices/system/cpu/cpu0/cpuidle/state2/time:4029039 > /sys/devices/system/cpu/cpu0/cpuidle/state2/usage:4863 > /sys/devices/system/cpu/cpu0/cpuidle/state3/desc:ACPI FFH INTEL MWAIT 0x30 > /sys/devices/system/cpu/cpu0/cpuidle/state3/disable:0 > /sys/devices/system/cpu/cpu0/cpuidle/state3/latency:100 > /sys/devices/system/cpu/cpu0/cpuidle/state3/name:C3 > /sys/devices/system/cpu/cpu0/cpuidle/state3/power:0 > /sys/devices/system/cpu/cpu0/cpuidle/state3/time:84569186 > /sys/devices/system/cpu/cpu0/cpuidle/state3/usage:10306 > [root@tp57 ~]# dmidecode > # dmidecode 2.12 > # No SMBIOS nor DMI entry point found, sorry. "nolapic_timer": Same result as with "maxcpus=1 noapic", i.e. it passes the point where the panic would occur but hangs later. Last messages: > sd 0:0:0:0: [sda] Attached SCSI disk > random: nonblocking pool is initialized > kjournald starting. Commit interval 5 seconds > EXT3-fs (sda3): mounted filesystem with ordered data mode > dracut: Mounted root filesystem /dev/sda3 > dracut: Loading SELinux policy > audit: type=1404 audit(1400223818.234:2): enforcing=1 old_enforcing=0 > auid=4294967295 ses=4294967295 Unclear why C6-ATM fails on this system.
I'm going to assume that it is a board-specific issue
that only the board designer understood when it
was designed (likely about 2008). That designer
assumed they could skip C6-ATM support and the
OS would not use it b/c the BIOS didn't export it.
But that assumption didn't anticipate the Linux intel_idle driver
exposing the C-states via CPUID HW enumeration
instead of via ACPI tables.
You have 2 viable workarounds:
intel_idle.max_cstate=3
disables C6-ATM, so C4-ATM is deepest state used
intel_idle.max_cstate=0
disables intel_idle driver completely
uses acpi_idle driver, which exports only down to MWAIT 0x30 (C4-ATM)
I do not recommend using "acpi=off", "maxcpus=0", "idle=poll" etc, which
helped us isolate the specific issue, but have other negative effects.
I would generally recommend a quirk
to disable C6-ATM for this board, but the board
doesn't seem identify itself:
> # No SMBIOS nor DMI entry point found, sorry.
Is it production-level hardware?
Is it running the latest BIOS?
lspci shows some interesting devices:
01:00.0 PCI bridge [0604]: PLX Technology, Inc. PEX 8505 5-lane, 5-port PCI Express Switch [10b5:8505] (rev aa) (prog-if 00 [Normal decode])
suggesting that this is not your typical poulsbo netbook...
If this board turns out to be "exotic" and not identifiable,
then my recommendation will be to close this report as DOCUMENTED
and you'll have to resort to deploying using one of the two
cmdline workarounds above.
Please re-open if there is a way for Linux to identify this board to install a quirk to disable C6 on it. Otherwise, my recommendation is that the user deploy the cmdline workaround above. |