Bug 21032 - System hangs on Boot on Intel ATOM (samsung n510 @nynet)
Summary: System hangs on Boot on Intel ATOM (samsung n510 @nynet)
Status: CLOSED CODE_FIX
Alias: None
Product: Power Management
Classification: Unclassified
Component: intel_idle (show other bugs)
Hardware: All Linux
: P1 normal
Assignee: Len Brown
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2010-10-24 02:18 UTC by Christian Bahls
Modified: 2010-11-02 02:57 UTC (History)
3 users (show)

See Also:
Kernel Version: 2.6.36
Subsystem:
Regression: Yes
Bisected commit-id:


Attachments
dmesg for 2.6.35.7 no maximum cstate (61.36 KB, text/plain)
2010-10-24 02:21 UTC, Christian Bahls
Details
dmesg for 2.6.36 no maximum cstate (62.90 KB, text/plain)
2010-10-24 02:22 UTC, Christian Bahls
Details
dmesg for 2.6.36 maximum cstate=0 (62.91 KB, text/plain)
2010-10-24 02:23 UTC, Christian Bahls
Details
dmesg for 2.6.36 maximum cstate=1 (62.85 KB, text/plain)
2010-10-24 02:23 UTC, Christian Bahls
Details
dmesg for 2.6.36 maximum cstate=2 (62.85 KB, text/plain)
2010-10-24 02:24 UTC, Christian Bahls
Details
dmesg for 2.6.36 maximum cstate=3 (66.09 KB, text/plain)
2010-10-24 02:24 UTC, Christian Bahls
Details
dmesg for 2.6.36 maximum cstate=4 (63.09 KB, text/plain)
2010-10-24 02:25 UTC, Christian Bahls
Details
dmesg for 2.6.36 nolapic_timer (63.13 KB, text/plain)
2010-10-24 02:25 UTC, Christian Bahls
Details
.config for 2.6.36 (59.44 KB, text/plain)
2010-10-24 02:26 UTC, Christian Bahls
Details
lspci -vk for 2.6.36 (7.44 KB, text/plain)
2010-10-24 02:28 UTC, Christian Bahls
Details
cat /proc/cpuinfo (652 bytes, text/plain)
2010-10-24 02:28 UTC, Christian Bahls
Details
acpidump for 2.6.36 (172.13 KB, text/plain)
2010-10-24 02:29 UTC, Christian Bahls
Details
dmidecode dump (8.75 KB, text/plain)
2010-10-24 02:30 UTC, Christian Bahls
Details
grep sysfiles for info about idle for max_cstate=0 (1.14 KB, text/plain)
2010-10-24 02:31 UTC, Christian Bahls
Details
patch vs 2.6.36 to avoid using the LAPIC timer in ATM-C2 (1.01 KB, patch)
2010-10-24 03:34 UTC, Len Brown
Details | Diff

Description Christian Bahls 2010-10-24 02:18:18 UTC
i have a samsung n510 @nynet netbook

starting in the 2.6.35 development cycle
the system hangs occasionally, especially on boot

This bug is present in 2.6.36 and 2.6.35.7

pressing the powerbutton or a key on the keyboard makes the system continue

this bug is quite reliable reproduceable ..

i did a bisection and ended up on following commit:
 2671717265ae6e720a9ba5f13fbec3a718983b65

booting the system with "intel_idle.max_cstate=0"
makes it not hang

The effects are strongest when intel_idle.max_cstate=2

Mr Brown gave me some Homework to do, here it comes:
failing .config .. attached
output from lspci .. hopefully can be attached
output from cat /proc/cpu .. hopefully can be attached
output from acpidump .. hopefully can be attached
output from dmidecode .. hopefully can be attached

for a 2.6.36 boot with intel_idle.max_cstate=0 (and acpi_idle boot)
output from grep . /sys/devices/system/cpu/cpuidle/*
output from grep . /sys/devices/system/cpu/cpu*/cpuidle/*/*
dmesg .. hopefully can be attached

try intel_idle.max_cstate=1, include dmesg
and increase the '1' until it fails
My guess is that 1 will work, but some higher number will start failing.

=> max_cstate=2 fails the strongest ... max_cstate=4 has not such a strong effect

without any other bootparams, try "nolapic_timer"

=> yes, works .. does not hang .. see (hopefully) attached dmesg ending in ".nolapic"
Comment 1 Christian Bahls 2010-10-24 02:21:14 UTC
Created attachment 34632 [details]
dmesg for 2.6.35.7 no maximum cstate
Comment 2 Christian Bahls 2010-10-24 02:22:12 UTC
Created attachment 34642 [details]
dmesg for 2.6.36 no maximum cstate
Comment 3 Christian Bahls 2010-10-24 02:23:02 UTC
Created attachment 34652 [details]
dmesg for 2.6.36 maximum cstate=0
Comment 4 Christian Bahls 2010-10-24 02:23:30 UTC
Created attachment 34662 [details]
dmesg for 2.6.36 maximum cstate=1
Comment 5 Christian Bahls 2010-10-24 02:24:05 UTC
Created attachment 34672 [details]
dmesg for 2.6.36 maximum cstate=2
Comment 6 Christian Bahls 2010-10-24 02:24:30 UTC
Created attachment 34682 [details]
dmesg for 2.6.36 maximum cstate=3
Comment 7 Christian Bahls 2010-10-24 02:25:02 UTC
Created attachment 34692 [details]
dmesg for 2.6.36 maximum cstate=4
Comment 8 Christian Bahls 2010-10-24 02:25:41 UTC
Created attachment 34702 [details]
dmesg for 2.6.36 nolapic_timer
Comment 9 Christian Bahls 2010-10-24 02:26:14 UTC
Created attachment 34712 [details]
.config for 2.6.36
Comment 10 Christian Bahls 2010-10-24 02:28:18 UTC
Created attachment 34722 [details]
lspci -vk for 2.6.36
Comment 11 Christian Bahls 2010-10-24 02:28:46 UTC
Created attachment 34732 [details]
cat /proc/cpuinfo
Comment 12 Christian Bahls 2010-10-24 02:29:40 UTC
Created attachment 34742 [details]
acpidump for 2.6.36
Comment 13 Christian Bahls 2010-10-24 02:30:20 UTC
Created attachment 34752 [details]
dmidecode dump
Comment 14 Christian Bahls 2010-10-24 02:31:48 UTC
Created attachment 34762 [details]
grep sysfiles for info about idle for max_cstate=0

       output from grep . /sys/devices/system/cpu/cpuidle/*
       output from grep . /sys/devices/system/cpu/cpu*/cpuidle/*/*
Comment 15 Len Brown 2010-10-24 03:32:13 UTC
> pressing the powerbutton
> or a key on the keyboard makes the system continue

does a single press get the system "un-stuck" and it
runs normally from then on, or do you have to continue
to give it button events to keep it from stalling again?

> max_cstate=2 fails the strongest
> ... max_cstate=4 has not such a strong effect

Hmm, this may be because in max_cstate=2, we use
C2 a lot, and the LAPIC timer is failing in C2.

But if not limited to C2, we rarely use C2
and lapic_timer_relaible_states instructs us
to not use that timer in C4 where we know it stops:

intel_idle: lapic_timer_reliable_states 0x6

> without any other bootparams, try "nolapic_timer"
> yes, works .. does not hang ..

Okay, that is a "smoking gun".
Comment 16 Len Brown 2010-10-24 03:34:48 UTC
Created attachment 34772 [details]
patch vs 2.6.36 to avoid using the LAPIC timer in ATM-C2

Please test this patch using no cmdline parameters.
Please show the output from

output from grep . /sys/devices/system/cpu/cpu*/cpuidle/*/*
Comment 17 Christian Bahls 2010-10-24 03:37:15 UTC
(In reply to comment #15)
> > pressing the powerbutton
> > or a key on the keyboard makes the system continue
> 
> does a single press get the system "un-stuck" and it
> runs normally from then on, or do you have to continue
> to give it button events to keep it from stalling again?

have to press it every time it stalls ..
 .. so even for shutdown i have to press a key ..

> > max_cstate=2 fails the strongest
> > ... max_cstate=4 has not such a strong effect
> 
> Hmm, this may be because in max_cstate=2, we use
> C2 a lot, and the LAPIC timer is failing in C2.
> 
> But if not limited to C2, we rarely use C2
> and lapic_timer_relaible_states instructs us
> to not use that timer in C4 where we know it stops:
> 
> intel_idle: lapic_timer_reliable_states 0x6
> 
> > without any other bootparams, try "nolapic_timer"
> > yes, works .. does not hang ..
> 
> Okay, that is a "smoking gun".

hopefully .. :)
Comment 18 Christian Bahls 2010-10-24 04:06:15 UTC
(In reply to comment #16)
> Created an attachment (id=34772) [details]
> patch vs 2.6.36 to avoid using the LAPIC timer in ATM-C2
> 
> Please test this patch using no cmdline parameters.

seems to work

> Please show the output from
> grep . /sys/devices/system/cpu/cpu*/cpuidle/*/*

/sys/devices/system/cpu/cpu0/cpuidle/state0/desc:CPUIDLE CORE POLL IDLE
/sys/devices/system/cpu/cpu0/cpuidle/state0/latency:0
/sys/devices/system/cpu/cpu0/cpuidle/state0/name:C0
/sys/devices/system/cpu/cpu0/cpuidle/state0/power:4294967295
/sys/devices/system/cpu/cpu0/cpuidle/state0/time:32
/sys/devices/system/cpu/cpu0/cpuidle/state0/usage:1
/sys/devices/system/cpu/cpu0/cpuidle/state1/desc:ACPI FFH INTEL MWAIT 0x0
/sys/devices/system/cpu/cpu0/cpuidle/state1/latency:1
/sys/devices/system/cpu/cpu0/cpuidle/state1/name:C1
/sys/devices/system/cpu/cpu0/cpuidle/state1/power:4294967294
/sys/devices/system/cpu/cpu0/cpuidle/state1/time:1213
/sys/devices/system/cpu/cpu0/cpuidle/state1/usage:15
/sys/devices/system/cpu/cpu0/cpuidle/state2/desc:ACPI FFH INTEL MWAIT 0x10
/sys/devices/system/cpu/cpu0/cpuidle/state2/latency:1
/sys/devices/system/cpu/cpu0/cpuidle/state2/name:C2
/sys/devices/system/cpu/cpu0/cpuidle/state2/power:4294967293
/sys/devices/system/cpu/cpu0/cpuidle/state2/time:179795525
/sys/devices/system/cpu/cpu0/cpuidle/state2/usage:2205

the patch is disabling lapic_timer for all ATOM platforms
 .. or just the nVidia MCP7 ?
Comment 19 Len Brown 2010-10-25 18:57:42 UTC
The patch is disabling lapic_timer for C2 on all Atom.

(note, this is what acpi_idle has been doing all along)

This isn't how the chip is designed to be hooked up,
but there is an additional sighting
over at bug 20172 where "nolapic_timer" prevents
a hang on an Atom with an Intel chip-set -- so maybe
the issue is more widespread than just this nvidia chipset.
Comment 20 Len Brown 2010-11-02 02:57:18 UTC
shipped in linux-2.6.37-rc1

closed

commit c25d29952b2a8c9aaf00e081c9162a0e383030cd
Author: Len Brown <len.brown@intel.com>
Date:   Sat Oct 23 23:25:53 2010 -0400

    intel_idle: do not use the LAPIC timer for ATOM C2

Note You need to log in before you can comment on or make changes to this bug.