Bug 60770
Summary: | Sony Laptop Boot hang, unless boot-on-AC or maxcpus=1 or acpi_osi= (REGRESSION BISECTED to mwait_idle()) -- Sony VAIO vgn-c2m Intel T5500 | ||
---|---|---|---|
Product: | Power Management | Reporter: | Ian Malone (ibmalone) |
Component: | cpuidle | Assignee: | Len Brown (lenb) |
Status: | CLOSED CODE_FIX | ||
Severity: | normal | CC: | aaron.lu, ibmalone, jwboyer, tianyu.lan |
Priority: | P1 | ||
Hardware: | x86-64 | ||
OS: | Linux | ||
Kernel Version: | 3.10.6-200.fc19 | Subsystem: | |
Regression: | Yes | Bisected commit-id: | |
Attachments: |
dmesg output from successful boot
acpidump pictures of failed boot process picture of failed boot process, debugging enabled git bisect log diff on dmesg battery vs psu dmidecode output test patch vs 3.13-rc6 to restore mwait_idle() test patch vs 3.13-rc6 to disable PDC IO_HALT bit |
Description
Ian Malone
2013-08-19 19:47:28 UTC
Created attachment 107247 [details]
acpidump
Created attachment 107248 [details]
pictures of failed boot process
Output during failed boot process. One picture of the final point at which boot halts, two captures of video taken from boot (video at 120fps, limiting factor seems to be screen refresh rate)
Please try to bootup with kernel param "acpi.debug_layer=0xffffffff acpi.debug_level=0xffffffff" and provide the photo shot when system halt. BTW, bootup on battery works before or never work? Created attachment 107250 [details]
picture of failed boot process, debugging enabled
Hi, I've tried that (a couple of times in case of typos) and the result looks the same to me (attached anyway). 3.8.13-100.fc17 allowed boot on battery, but I had to upgrade before I could report in Fedora bugzilla and no longer have that kernel.
On more thing I've noticed: the power state (battery/mains) at the time of loading the kernel is not what matters. It depends on the power state at power on/loading grub. If I turn external power on or off at the grub menu before proceeding it makes no difference, boot fail occurs if running on battery to the point grub has started and boot is successful if running on mains to that point. However the kernel does have an effect, when I was using the 3.8.13 kernel above switching between that and the later ones affected the boot success.
So this is a regression and the photo doesn't show any clues. Could you do a bisect between v3.8.13 and v.10.6 to find which commit causes this issue? You can follow this link to do bisect. http://wiki.gentoo.org/wiki/Kernel_git-bisect Created attachment 107288 [details]
git bisect log
Hi,
Sorry for the delay, 13 kernel bisects on a T5500 laptop takes a while...
69fb3676df3329a7142803bb3502fa59dc0db2e3
"x86 idle: remove mwait_idle() and "idle=mwait" cmdline param"
Double checked this versus the commit before
6a377ddc4e4ede2eeb9cd46ada23bbe417704fc9
which is good.
Also, starting on battery I tried idle=mwait, idle=halt and idle=poll options with the bad commit, idle=poll works (though is less than ideal on battery).
On the last good kernel passing idle=nomwait causes the same hang and idle=mwait boots okay. So the problem seems to be that the previous idle strategy was mwait which is no longer available and that the others cause this hang at boot when on battery for some reason. I have now also tried "intel_idle.max_cstate=0 processor.max_cstate=0" on the bad kernel since it's supposed to have some interaction with this and it still does not boot on battery. Possibly relevant, the T5500 is listed as not supporting idle states: http://ark.intel.com/products/27253 though some of the higher end members of the T5000 series do. Hi, any further testing I can do to move this on? Tried: intel_idle.max_cstate=1 processor.max_cstate=1 (in addition to =0 in comment 7) hpet=disable Bios settings have no power management options. I'm unable to clear the needinfo flag, but there's no outstanding request for information (except mine as to what else I can collect). * The problem only occurs on boot, AC-powered boot is okay, battery boot is not. * Going to battery after boot is okay. * Boot with idle=mwait before it was retired was fine, mwait hints was not, nor were the nomwait options. * idle=poll is fine, but eats battery. Whatever interaction is going on with idle states during boot is a major culprit here, but I don't know enough to be able to start investigating it without help. This is also true of whatever the difference is between mwait and mwait with hints. The processor in question apparently does not support C states. Why idle=nomwait and attempts I've made to not use idle modes don't work I'm not sure. Created attachment 112121 [details]
diff on dmesg battery vs psu
I've been looking at dmesg output on this machine, bearing in mind it's something to do with Mwait and c-states. The following difference in output occurs long after the hang (at 0.6s, compared to 0.1s), but may be relevant in terms of what mwait and mwait hints think is possible:
good and bad kernels on ac power:
[ 0.669814] ACPI: Power Button [PWRB]
[ 0.670733] ACPI: Requesting acpi_cpufreq
[ 0.672438] Monitor-Mwait will be used to enter C-1 state
[ 0.672447] Monitor-Mwait will be used to enter C-2 state
[ 0.672451] tsc: Marking TSC unstable due to TSC halts in idle
good kernel on battery power:
[ 1.240106] ACPI: Power Button [PWRB]
[ 1.241927] ACPI: Requesting acpi_cpufreq
[ 1.245699] Monitor-Mwait will be used to enter C-1 state
[ 1.245713] Monitor-Mwait will be used to enter C-2 state
[ 1.245721] Monitor-Mwait will be used to enter C-3 state
[ 1.245726] tsc: Marking TSC unstable due to TSC halts in idle
Attached is full diff on the last good commit on 3.8.0-rc6+ dmesg output battery versus psu.
This changes dynamically: power online: $ for F in /sys/devices/system/cpu/cpu0/cpuidle/state?/desc ; do echo $F:$(<$F) ; done /sys/devices/system/cpu/cpu0/cpuidle/state0/desc:CPUIDLE CORE POLL IDLE /sys/devices/system/cpu/cpu0/cpuidle/state1/desc:ACPI FFH INTEL MWAIT 0x0 /sys/devices/system/cpu/cpu0/cpuidle/state2/desc:ACPI FFH INTEL MWAIT 0x10 battery: $ for F in /sys/devices/system/cpu/cpu0/cpuidle/state?/desc ; do echo $F:$(<$F) ; done /sys/devices/system/cpu/cpu0/cpuidle/state0/desc:CPUIDLE CORE POLL IDLE /sys/devices/system/cpu/cpu0/cpuidle/state1/desc:ACPI FFH INTEL MWAIT 0x0 /sys/devices/system/cpu/cpu0/cpuidle/state2/desc:ACPI FFH INTEL MWAIT 0x10 /sys/devices/system/cpu/cpu0/cpuidle/state3/desc:ACPI FFH INTEL MWAIT 0x30 Created attachment 115051 [details]
dmidecode output
Spent a little while investigating whether I could disable C3 on the basis that seemed to be responsible. However:
1. Using processor_idle.max_cstate=2 during boot, C3 still shows up when AC power is taken off.
2. It seems this state is used successfully once the machine is running:
On AC:
$ cat /sys/class/power_supply/ADP1/online; for F in /sys/bus/cpu/drivers/processor/cpu0/cpuidle/state? ; do echo $(cat $F/name), $(cat $F/time) ; done
1
POLL, 272
C1, 7728
C2, 186921110
On battery (note C3 appears):
$ cat /sys/class/power_supply/ADP1/online; for F in /sys/bus/cpu/drivers/processor/cpu0/cpuidle/state? ; do echo $(cat $F/name), $(cat $F/time) ; done
0
POLL, 0
C1, 33494
C2, 4215
C3, 1175271
Attaching the dmidecode output in case it's any use.
I've also noticed looking at the dmesg output:
[ 0.076106] ACPI: Added _OSI(Processor Aggregator Device)
[ 0.077675] ACPI: EC: Look up EC in DSDT
[ 0.110328] ACPI: SSDT 000000007f691300 00240 (v01 Sony VAIO 20061219 PTL 20050624)
[ 0.110656] ACPI: Dynamic OEM Table Load:
[ 0.110659] ACPI: SSDT (null) 00240 (v01 Sony VAIO 20061219 PTL 20050624)
- the crash happens at about 0.1, mention of idle states doesn't occur (in a successful boot) until about 0.5s later:
[ 0.663366] intel_idle: does not run on family 6 model 15
[ 0.666497] ACPI: AC Adapter [ADP1] (on-line)
[ 0.666643] input: Lid Switch as /devices/LNXSYSTM:00/device:00/PNP0C0D:00/input/input0
[ 0.668136] ACPI: Lid Switch [LID0]
[ 0.668195] input: Power Button as /devices/LNXSYSTM:00/device:00/PNP0C0C:00/input/input1
[ 0.668201] ACPI: Power Button [PWRB]
[ 0.669080] ACPI: Requesting acpi_cpufreq
[ 0.670772] Monitor-Mwait will be used to enter C-1 state
[ 0.670781] Monitor-Mwait will be used to enter C-2 state
[ 0.670785] tsc: Marking TSC unstable due to TSC halts in idle
[ 0.670791] ACPI: acpi_idle registered with cpuidle
[ 0.672209] ACPI: Invalid active0 threshold
(intel_idle is not used as the driver doesn't support it) I'm left wondering if the bug is not really cstate related but something related to mwait with hints default behaviour during boot which is different from the old mwait. Note boot with idle=poll (or whatever the correct form of the poll option is) does work.
Just tried acpi=ht, fails. Note to self: https://01.org/linux-acpi/documentation/debug-how-isolate-linux-acpi-issues acpi=ht the most like "acpi=off", disables all of ACPI except what is needed to enumerate processors. If acpi=off works and acpi=ht fails, then the issue is in the ACPI table parsing code itself, or perhaps the SMP code. Thanks for debugging, Ian. Yes, it is common for ACPI to hide C3 on AC, and expose it on DC -- particularly on older systems where C3 was very expensive. You are positive that your system broke starting with this commit? 69fb3676df3329a7142803bb3502fa59dc0db2e3 "x86 idle: remove mwait_idle() and "idle=mwait" cmdline param" What that commit does is remove the MWAIT C1 idle loop, on the assumption that every system can run HALT instead -- at least until cpuidle probes and adds the deeper C-states. Perhaps that assumption is not always valid on your system... On the last working kernel, booted on DC, you can boot successfully with idle=halt, or no? If that works, then HALT itself is okay, but perhaps HALT in combination with C3 is not okay. You mention that "idle=nomwait" fails on the last good kernel. Is this true, whether booting on AC or booting on DC? Since you can boot on AC, you are doing fine with C1 and C2. But when you boot on DC and try to use C1/C1/C3, you fail. Interestingly, this is a boot-time issue -- as you can boot on AC and pull the plug and then get to exactly the same state as if you had booted on DC -- C1/C2/C3. I assume you can switch between AC and DC a bunch of times without troubles, not just once? try booting with "processor.max_cstate=2" and that should get rid of C3, even when you boot or run on DC. Try this also in combination with idle=nomwait. If that doesn't work, try it again with the value '1' instead of '2' My guess at this point is that there is BIOS SMM involved in the idle path on this machine, and Linux is breaking some assumption that BIOS code is making. If that turns out to be the case, we would probably have to implement some sort of DMI workaround for this system. (please check that you've got the latest BIOS) Thanks for getting back to me. Yes that's the affected commit, I've got before and after kernels that do/don't boot. The before (good, 3.8.0-rc6+) kernel wont boot on DC with idle=halt or with idle=nomwait. In both idle=halt and idle=nomwait, good kernel, I've just tried both processor.max_cstate=1 and processor.max_cstate=2, neither will boot on 'DC'*. However I'm not sure these are properly disabling the higher cstates at the relevant stage, much later during boot there's a mention that the C-state is limited (about the time the idle driver loads). *Something incorrect in my earlier report: what matters for the DC/AC behaviour is whether the computer is plugged in during BIOS start, unplugging/plugging while in grub does not affect the behaviour. To be clear, idle=halt: BIOS start on DC, grub keep DC : fail BIOS start on DC, grub plug AC : fail BIOS start on AC, grub keep AC : boot BIOS start on AC, grub unplug DC : boot I've just tried maxcpus=1 and this works on a DC boot, both with a fairly recent kernel (3.11.9) and with the good kernel in idle=halt mode. Though maybe that just means the processor never goes idle. Re BIOS: there has never been an update released for the BIOS on this system. It's a heavily locked-down Phoenix BIOS (reports version R0080J4). The 'advanced' menu is limited to disabling the Vaio animation at boot and network boot... Still pushing random buttons... acpi_osi= appears to work without restricting c-states, processors, backlight etc. Diffing dmesg output versus powered boot without this option shows no difference in output prior to the point the hang occurs other than small tsc and bogomips number changes and this line indicating OSI disabled: - ACPI: _OSI method disabled Later in boot a few devices appear in different order. I'd previously tried acpi_osi="Linux" and a few variations on Microsoft and Windows. (e.g. "Windows 2006" which should be Vista, the installed OS on the machine), but these made no difference. I've tried acpi_osi=!, acpi_osi=!* and acpi_osi="!STRING" with all four of the reported strings at the same time (processor device, processor aggregator, 3.0 _scp extensions, module device, with appropriate capitals). None of these appear to help. The acpi_osi=!, acpi_osi=!* I tried on a 3.12 kernel in case the option wasn't available in the Fedora packaged kernels. Only "acpi_osi=" helps, the one that disables _OSI (though Added _OSI still gets reported) . I'm not sure how to check what strings get reported to try disabling any more specific ones. Created attachment 120251 [details]
test patch vs 3.13-rc6 to restore mwait_idle()
Please test this patch, which restores mwait_idle() for C1
during the boot process.
Re: "acpi_osi=" boots This is a clue to the root cause. As "acpi_osi=!" didn't help, this has nothing to do with the value of OSI, but if OSI is valid available at all. So I think your machine hangs at this SMM TRAP: Device (PCI0) { Method (_INI, 0, NotSerialized) { If (CondRefOf (_OSI, Local0)) { If (MPEN) { TRAP (0x3D) } } Else { } Store (Zero, ^LPCB.SNC.XECR) Store (Zero, ^LPCB.SNC.SNBF) } Re: "maxcpus=1" boots Likely this is causing the BIOS to clear MPEN, which also avoids the same killer TRAP to SMM. Re: DC boot fails, AC boot works An additional clue. But I didn't notice any hooks related to that in the DSDT -- presumably there is some dependency inside the TRAP to SMM. You could possibly verify that this is the area of trouble by enabling some console debugging (see Documentation/acpi/* in your source tree) or modifying your DSDT to tickle the suspect area. Created attachment 120261 [details]
test patch vs 3.13-rc6 to disable PDC IO_HALT bit
Please also try this test patch by itself.
If it works, we have a nasty BIOS incompatibility problem --
so I hope that this one makes no difference.
Thanks, the mwait patch https://bugzilla.kernel.org/attachment.cgi?id=120251&action=diff works, The PDC patch (as hoped) does not boot under DC power https://bugzilla.kernel.org/attachment.cgi?id=120261&action=diff Len, is this queued for 3.13 or 3.14 somewhere? I don't see it in linux-next. Ian, Thanks for confirming the patch works. Josh, I've put it on my -next branch, and have submitted it to lkml for upstream: http://marc.info/?l=linux-kernel&m=138976439211647&w=2 Len, I still don't see this in linux-next and I don't see it on your -next branch on kernel.org: https://git.kernel.org/cgit/linux/kernel/git/lenb/linux.git/log/arch/x86/kernel/process.c?h=next The upstream thread had a few comments, but I don't see anything else as a result that turned up in the tree either. Any idea what's happened here? (In reply to Josh Boyer from comment #25) > Len, I still don't see this in linux-next and I don't see it on your -next > branch on kernel.org: > > https://git.kernel.org/cgit/linux/kernel/git/lenb/linux.git/log/arch/x86/ > kernel/process.c?h=next > > The upstream thread had a few comments, but I don't see anything else as a > result that turned up in the tree either. > > Any idea what's happened here? Len? looks like we didn't fix this upstream -- re-opening. commit b253149b843f89cd300cbdbea27ce1f847506f99 Author: Len Brown <len.brown@intel.com> Date: Wed Jan 15 00:37:34 2014 -0500 sched/idle/x86: Restore mwait_idle() to fix boot hangs, to improve power savings and to improve performance entered v4.1. |