Bug 11101

Summary: Regression: Freeze during boot on MSI M673 laptop
Product: ACPI Reporter: Dmitry Nezhevenko (dion)
Component: Config-InterruptsAssignee: Glauber Costa (gcosta)
Status: CLOSED CODE_FIX    
Severity: normal CC: acpi-bugzilla, bunk, gcosta, krisvek, mingo, rjw
Priority: P1    
Hardware: All   
OS: Linux   
Kernel Version: 2.6.26 Subsystem:
Regression: Yes Bisected commit-id:
Bug Depends on:    
Bug Blocks: 10492    
Attachments: 2.6.26 dmesg
2.6.26 dmesg with acpi=off (works)
lspci && lsmod
2.6.25.6 dmesg (that works great on this machine)
acpidump output
Enable IO-APIC after all cpus are booted
dmesg with applied patch "Enable IO-APIC after all cpus are booted"
full dmesg with applied patch "Enable IO-APIC after all cpus are booted"
dmesg with nolapic_timer
patch for c1e idle
2.6.26 with applied c1e patch
2.6.26 x86_64 dmesg
2.6.27-rc3, booted with nolapic_timer
dmesg output after having booted WITH acpi (and being very patient, and pressing the enter button a bunch at a later point) for kernel 2.6.26.2.
force c1e idle routine
Verbose debug messages for arch/x86/kernel/process.c
Screenshot of last messages with applied force-c1idle.patch and debug info patch
2.6.26 dmesg with apic=debug
2.6.26 dmesg with apic=debug nolapic_timer
17446: bisected dmesg with apic=debug
bisected dmesg with apic=debug nolapic_timer

Description Dmitry Nezhevenko 2008-07-16 09:33:12 UTC
Latest working kernel version: 2.6.25.6
Distribution: Debian unstable
Hardware Environment: MSI M673 laptop, i386

Hi. I've just compiled 2.6.26 vanilla kernel. During booting it just freeze after detecting my SATA devices. However after pressing some button (e.g. space, enter, anything) it continue boot for a while (1-2 seconds) and then freeze again. Finally I've booted it (with pressed enter button).

I've tried followed kernel parameters with no success:
1. noapic
2. pci=nommconfig
3. noacpi
4. acpi_processor.max_cstate=1

Passing acpi=off to kernel helps. It boots without freeze, however there are no PM features.

Currently my 'main' kernel is 2.6.25.6 which works great (including s2ram, s2disk using TuxOnIce).

I'm attaching dmesg, lspci, lsmod  and some other info.
Comment 1 Dmitry Nezhevenko 2008-07-16 09:33:48 UTC
Created attachment 16842 [details]
2.6.26 dmesg
Comment 2 Dmitry Nezhevenko 2008-07-16 09:34:16 UTC
Created attachment 16843 [details]
2.6.26 dmesg with acpi=off (works)
Comment 3 Dmitry Nezhevenko 2008-07-16 09:35:32 UTC
Created attachment 16844 [details]
lspci && lsmod
Comment 4 Dmitry Nezhevenko 2008-07-16 09:36:21 UTC
Created attachment 16845 [details]
2.6.25.6 dmesg (that works great on this machine)
Comment 5 Dmitry Nezhevenko 2008-07-16 09:37:40 UTC
PS. I've assigned this bug to ACPI, since there is no such problem with acpi=off. Sorry if this incorrect
Comment 6 ykzhao 2008-07-16 23:45:54 UTC
Hi, Dmitry
   Will you please try git-bisect to identify which commit causes this regression?
   Will you please attach the output of acpidump? 
   thanks.
Comment 7 Dmitry Nezhevenko 2008-07-17 01:10:20 UTC
Created attachment 16857 [details]
acpidump output

Attaching acpidump output
Comment 8 Dmitry Nezhevenko 2008-07-17 01:11:07 UTC
I'll try to use git-bisect later today or tomorrow.
Comment 9 Dmitry Nezhevenko 2008-07-17 14:12:16 UTC
Finally I've done git-bisect.

9713277607f9eac7d655c6854dd92bc2ce1b6f02 is first bad commit
commit 9713277607f9eac7d655c6854dd92bc2ce1b6f02
Author: Glauber de Oliveira Costa <gcosta@redhat.com>
Date:   Wed Mar 19 14:25:43 2008 -0300

    x86: boot cpus from cpu_up, instead of prepare_cpus

    After all the infrastructure work, we're now prepared
    to boot the cpus from cpu_up, and not from prepare_cpus.
    So the difference between cold boot and hotplug is effectively
    over, and the functions are used to the purposes they're meant to.

    Signed-off-by: Glauber Costa <gcosta@redhat.com>
    Signed-off-by: Ingo Molnar <mingo@elte.hu>

:040000 040000 c191cf15825af14a8f4e2842105c33e19ad9df1d 3a6e4c6a5489a20722e83c98204f16d29bdc01fd M      arch

Please let me know, if you need any other info about this laptop.
Comment 10 Zhang Rui 2008-07-17 18:22:30 UTC
add Glauber and Ingo to the cc list. :)
as this is a regression, cc Rafael.
Comment 11 Glauber Costa 2008-07-21 07:42:24 UTC
Sorry for the delay in answering this, I was travelling and just returned to life today.

I've seen this problem before, but got no success in fixing it, since I can't reproduce. This seems to happen only in x86_64 machines (this is your case, right?), and the furthest I came was to the conclusion that it was something related to udelay waits that were probably taking too long.

I'll give this a further thought.
Comment 12 Dmitry Nezhevenko 2008-07-21 07:54:33 UTC
It's Turion64 X2 based laptop with nvidia chipset. Machine supports x86_64 however I'm running 32-bit kernel and distro. dmesg and all other attached files are from 32 bit environment.
Comment 13 Dmitry Nezhevenko 2008-07-23 11:57:04 UTC
Hi again. I've spent a bit more time with this issue. First of all I've tried revision before 9713277607f9eac7d655c6854dd92bc2ce1b6f02 and it works. So now I'm sure that bisect result is correct. 

Also there is option named "Advanced powersaving" in BIOS of this laptop and it's currently turned on. Turning this option off fixes this issue too. I don't know exactly what does it do, so I've compared dmesg output from 2.6.25 with this option turned on and turned off. One of important changes is this line:

AMD C1E detected late.  Force timer broadcast.

So I think that it's enables/disables this "AMD C1E" feature. On 2.6.25 this option reduces power usage from battery (according to powertop).

Also I've tried to decompile and compile again DSDT from ACPI according to some googled manual. There was 0 errors and 7 warnings. All of them looks same, just different line numbers. Here is one of them:

DSDT.dsl  4825: Acquire (MUTE, 0x03E8)
Warning  1103 - ^ Possible operator timeout is ignored

I don't know, could this be cause of this problem.

Please let me know, if I can help. I'm ready to test anything :)
Comment 14 Glauber Costa 2008-07-23 13:02:13 UTC
Created attachment 16955 [details]
Enable IO-APIC after all cpus are booted

Could you please try the following patch?

It's not a fix, but just to confirm my theory.
I think that the problem is that the IO-APIC is now initialized before all cpus are put up. The dmesg that works complains about the irqchip not being connected to pin0, and the message goes away in the broken kernel.

Let's see.
Comment 15 Dmitry Nezhevenko 2008-07-23 13:26:02 UTC
I've patched 2.6.26 release. Same freeze and I should press buttons to continue boot. Please let me know, if you need dmesg from this patched kernel. I'll build modules then (currently I'm building only zImage).
Comment 16 Glauber Costa 2008-07-23 13:38:46 UTC
Yes, please.
Comment 17 Dmitry Nezhevenko 2008-07-23 14:31:32 UTC
Created attachment 16956 [details]
dmesg with applied patch "Enable IO-APIC after all cpus are booted"

Here is my dmesg. I've marked, where it freezes. Currently I'm trying to understand, why there are no first lines (like version, etc). I've CONFIG_LOG_BUF_SHIFT=17 and it should be enough.
Comment 18 Adrian Bunk 2008-07-23 14:35:10 UTC
(In reply to comment #17)
> Created an attachment (id=16956) [details]
> dmesg with applied patch "Enable IO-APIC after all cpus are booted"
> 
> Here is my dmesg. I've marked, where it freezes. Currently I'm trying to
> understand, why there are no first lines (like version, etc). I've
> CONFIG_LOG_BUF_SHIFT=17 and it should be enough.

Does "dmesg -s 1000000" give more output?
Comment 19 Dmitry Nezhevenko 2008-07-23 14:44:39 UTC
Created attachment 16957 [details]
full dmesg with applied patch "Enable IO-APIC after all cpus are booted"

Please ignore my last message. /bin/dmesg was a symlink to busybox on this tiny rootfs.
Comment 20 Glauber Costa 2008-07-28 10:38:06 UTC
Hi.

What happens if you boot with the "nolapic_timer" option ?

I think this one might be the same as another one just reported on lkml
Comment 21 Dmitry Nezhevenko 2008-07-28 11:15:57 UTC
Created attachment 17014 [details]
dmesg with nolapic_timer

Hi. With this option kernel boots as expected withod any freezes.
Comment 22 Glauber Costa 2008-07-28 11:22:42 UTC
Created attachment 17015 [details]
patch for c1e idle

C1E machines (which is your case), have a broken lapic timer. As we changed the smp bootup process a little bit, the lapic timer is now being used, and it's probably what is hurting your box. At first, the solution would be to disable it again.

However, ingo's tree has a patch from tglx that enables the lapic timer on such machines, by means of a different idle function. So disabling it would be a step backwards.

Can you please confirm that your tree _do not_ have the following cset?

commit aa276e1cafb3ce9d01d1e837bcd67e92616013ac
Author: Thomas Gleixner <tglx@linutronix.de>
Date:   Mon Jun 9 19:15:00 2008 +0200

    x86, clockevents: add C1E aware idle function


I'm attaching the patch here. Does it apply on your tree? Can you confirm that it works as expected with the patch? (Maybe we do have to disable it in some cases, who knows...)

Thanks.
Comment 23 Dmitry Nezhevenko 2008-07-28 11:38:55 UTC
1. There is no aa276e1cafb3ce9d01d1e837bcd67e92616013ac in my tree. At least git checkout aa276e1cafb3ce9d01d1e837bcd67e92616013ac reports, that pathspec did not match any file(s).

2. Patch does NOT apply cleanly to my tree (I'm using v2.6.26 release):
patching file arch/x86/kernel/apic_32.c
patching file arch/x86/kernel/apic_64.c
patching file arch/x86/kernel/cpu/amd.c
Reversed (or previously applied) patch detected!  Assume -R? [n] n
Apply anyway? [n] n
Skipping patch.
2 out of 2 hunks ignored -- saving rejects to file arch/x86/kernel/cpu/amd.c.rej
can't find file to patch at input line 139
Perhaps you used the wrong -p or --strip option?
The text leading up to this was:
--------------------------
|diff --git a/arch/x86/kernel/cpu/amd_64.c b/arch/x86/kernel/cpu/amd_64.c
|index f5fc161..f8d2058 100644
|--- a/arch/x86/kernel/cpu/amd_64.c
|+++ b/arch/x86/kernel/cpu/amd_64.c
--------------------------
File to patch:
Skip this patch? [y] y
Skipping patch.
2 out of 2 hunks ignored
patching file arch/x86/kernel/process.c
Hunk #2 succeeded at 138 with fuzz 1 (offset -82 lines).
Hunk #3 FAILED at 217.
1 out of 3 hunks FAILED -- saving rejects to file arch/x86/kernel/process.c.rej
patching file include/asm-x86/apic.h
patching file kernel/time/tick-broadcast.c

Looks like something was already applied. I've fixed one FAILED hunk by hands. It was easy enough. However I can't compile kernel:

arch/x86/kernel/process.c: In function ‘c1e_idle’:
arch/x86/kernel/process.c:175: error: ‘MSR_K8_INT_PENDING_MSG’ undeclared (first use in this function)
arch/x86/kernel/process.c:175: error: (Each undeclared identifier is reported only once
arch/x86/kernel/process.c:175: error: for each function it appears in.)
arch/x86/kernel/process.c:176: error: ‘K8_INTP_C1E_ACTIVE_MASK’ undeclared (first use in this function)
make[1]: *** [arch/x86/kernel/process.o] Error 1

Which tree should I use to test it?
Comment 24 Glauber Costa 2008-07-28 11:45:01 UTC
Are you using linus' ?

Just tell me what tree/tip-commit you're using, I'll prepare a patch that applies cleanly there.
Comment 25 Dmitry Nezhevenko 2008-07-28 11:48:45 UTC
I've cloned  git://git.kernel.org/pub/scm/linux/kernel/git/stable/linux-2.6.26.y.git (tag v2.6.26) when you asked me for git-bisect. As far as I understand, it's 2.6.26 release. If you want, I can switch to any other tree.
Comment 26 Glauber Costa 2008-07-28 12:12:40 UTC
In this case, you can switch to git://git.kernel.org/pub/scm/linux/kernel/git/x86/linux-2.6-x86.git.  It should have it.
Comment 27 Dmitry Nezhevenko 2008-07-28 13:12:16 UTC
Created attachment 17016 [details]
2.6.26 with applied c1e patch

sorry for delay. I'm currently at:
commit 8cf840bad1f38c23650da023f33331c25ebd5b58
Merge: bc24587... 0e241ff...
Author: Ingo Molnar
Date:   Mon Jul 28 18:32:31 2008 +0200

    Merge branch 'core/urgent'

I can't boot this kernel without nolapic_timer option. It just freezes and there is no way to continue boot. Here are last lines that I see:
 PCI: Using ACPI for IRQ routing
 hpet0: at MMIO 0xfed00000, IRQ 2, 8, 31
 hpet0: 3 32-bit timers, 25000000 Hz
 ACPI: RTC can wake from S4
Comment 28 Dmitry Nezhevenko 2008-07-28 13:14:22 UTC
(attached dmesg is from kernel with nolapic_timer)
Comment 29 Dmitry Nezhevenko 2008-07-28 13:54:07 UTC
Hmm. It looks like this tree contains patch aa276e1cafb3ce9d01d1e837bcd67e92616013ac.  Is it ok?
Comment 30 Zhang Rui 2008-07-28 18:04:20 UTC
re-assign to Glauber. :)
Comment 31 Dmitry Nezhevenko 2008-08-08 14:02:18 UTC
Created attachment 17147 [details]
2.6.26 x86_64 dmesg

Hi guys. I've just setup x86_64 chroot and was able to build 64bit kernel. This tree (I'm at 8cf840bad1f38c23650da023f33331c25ebd5b58) do NOT work on x86_64 too. It freezes with same last lines. And pressing buttons do not resume booting. nolapic_timer do NOT helps. It still unbootable.

Then I've tried to boot debian 2.6.26 x86_64 kernel (if you want, I can recompile vanilla 2.6.26 too). And it boots. So looks like we have two regressions with my laptop:

1. 2.6.26 release freezes during boot in 32bit mode, but works in x86_64. 

2. Both i386 and x86_64 freezes on 8cf840bad1f38c23650da023f33331c25ebd5b58.

I'm attaching dmesg from debian 2.6.26 x86_64 kernel, that works without any special kernel parameters.

From now, I'm able to test both i386 and x86_64 kernels on this laptop.
Comment 32 Zhang Rui 2008-08-17 22:41:03 UTC
dmitry,
does this problem still exist in the latest kernel release?
Comment 33 Dmitry Nezhevenko 2008-08-18 00:34:19 UTC
I'll try to reproduce it with 2.6.27-rc3 today (both x86 and x86_64 builds) and let you know. Please let me know, if I should test another release or some git tree.
Comment 34 Dmitry Nezhevenko 2008-08-18 09:27:12 UTC
Created attachment 17305 [details]
2.6.27-rc3, booted with nolapic_timer

i386: Problem is still exists with 2.6.27-rc3. I can boot this kernel only with nolapic_timer option. According to dmesg with this option, kernel correctly detects that C1E is enabled:

[    0.004000] using C1E aware idle routine
[    0.156009] System has C1E enabled

I'm attaching dmesg from this 2.6.27-rc3 kernel that was booted with "nolapic_timer". I can't get dmesg without this option

I will be able to provide x86_64 dmesg tomorrow, if you need it.
Comment 35 Kris Vek 2008-08-24 14:33:51 UTC
I experience this problem as well on my Dell Inspiron 1501, x86 running ArchLinux, kernel 2.6.26.2.  Same symptoms.  I'll attach my dmesg output, for booting with acpi, as soon as I figure out how.
Comment 36 Kris Vek 2008-08-24 14:37:01 UTC
Created attachment 17421 [details]
dmesg output after having booted WITH acpi (and being very patient, and pressing the enter button a bunch at a later point) for kernel 2.6.26.2.
Comment 37 Dmitry Nezhevenko 2008-08-24 14:39:10 UTC
Hmm.. So one more affected system. You should be able to boot kernel with nolapic_timer. 

Use "Create new attachment" link at the bottom of this page (end of attachments table)
Comment 38 Kris Vek 2008-08-24 18:20:55 UTC
To confirm, nolapic_timer also allows me to boot without further obvious issues.  I'll post my dmesg from it as well when I get the chance.
Comment 39 Glauber Costa 2008-08-25 11:22:04 UTC
Created attachment 17439 [details]
force c1e idle routine

Guys, can you please check what happens when you force the kernel to use the c1e idle routine ? 

you can do that with the patch I'm attaching to this message.
Comment 40 Dmitry Nezhevenko 2008-08-25 12:41:42 UTC
Hi.. I've tried this patch on 2.6.27-rc4 system. No difference. And I still can't boot machine with C1E at all. So I'm unable to get dmesg.

So I've added "!!! " to printk message in patch (so it will be easy to see it) and  boot_delay=100 to kernel parameters. 

There were two "!!! using C1E aware idle routine" messages during boot (I think because this machine contains two CPU cores). Last messages before freezes are still:
 hpet0: at MMIO 0xfed00000, IRQ 2, 8, 31
 hpet0: 3 32-bit timers, 25000000 Hz
 ACPI: RTC can wake from S4

I will try to add more printk messages to process.c. Maybe in all idle functions.
Comment 41 Dmitry Nezhevenko 2008-08-25 13:08:04 UTC
I think, that your idea was, that for some reason kernel use default_idle instead of c1e_idle. Currently I can confirm, that kernel freezes using c1e_idle(). I've added some printk() calls to c1e_idle.

There were a lot of invokations of c1e_idle() during boot. Some of them were just returned because of need_resched(). Here is what's going in this file just after last kernel message (ACPI: RTC can wake from S4):

1. c1e_idle were invoked twice. (I think this is because system is SMP). 
2. one invokation was returned beacuse of "if (need_resched()) return" code.
3. second invokation passed to "if (c1e_detected) {" block
4. block "if (!cpu_isset(cpu, c1e_mask)) {" was not called.
5. clockevents_notify(CLOCK_EVT_NOTIFY_BROADCAST_ENTER, &cpu) was called
6. system freeze

printk message just after clockevents_notify() was not called. 

I will attach my "debug" patch and photo with last messages on the screen.
Comment 42 Dmitry Nezhevenko 2008-08-25 13:09:44 UTC
Created attachment 17443 [details]
Verbose debug messages for arch/x86/kernel/process.c
Comment 43 Dmitry Nezhevenko 2008-08-25 13:13:50 UTC
Created attachment 17444 [details]
Screenshot of last messages with applied force-c1idle.patch and debug info patch

Hope, that this information will be useful. I will try to take a look to clockevents_notify() however I think, that I have no enough kernel hacking skills to fix it
Comment 44 Glauber Costa 2008-08-25 13:57:53 UTC
Ok, before digging into it, I noticed we have no failing dmesg with apic debug messages turned on. I'm thinking that maybe we're broadcasting too fast , and this might be the problem.

apic=debug command line switch should give us more information.

Can you provide two dmesgs, with and without nolapic_timer?

I understand you can't boot to completion, so you might want to rollback a few commits up to the point you could , just for this.
Comment 45 Dmitry Nezhevenko 2008-08-25 14:04:50 UTC
I've added a bit more printk to some files. And traced it to kernel/notifier.c. 
System freezes somewhere in ret = nb->notifier_call(nb, val, v);

I can't debug it anymore since I don't know function name that is referenced by "nb" and I don't know any way, how to get any useful information from struct notifier_block. As far sa I understand, some code/module subscribes to notifications about clockevents and freezes for some reason. 
Comment 46 Dmitry Nezhevenko 2008-08-25 14:40:41 UTC
Created attachment 17446 [details]
2.6.26 dmesg with apic=debug
Comment 47 Dmitry Nezhevenko 2008-08-25 14:42:04 UTC
Created attachment 17447 [details]
2.6.26 dmesg with apic=debug nolapic_timer

I've used 2.6.26.3 release since I know that it's bootable. If you need, I can bisect, which revision make system unbootable without nolapic_timer.
Comment 48 Glauber Costa 2008-08-25 14:54:25 UTC
No. It should use a broken kernel, but one that you can boot by pressing enter.
The first one after the commit you bisected to is fine.
Comment 49 Dmitry Nezhevenko 2008-08-26 12:03:01 UTC
Created attachment 17464 [details]
17446: bisected dmesg with apic=debug
Comment 50 Dmitry Nezhevenko 2008-08-26 12:06:54 UTC
Created attachment 17465 [details]
bisected dmesg with apic=debug nolapic_timer

Hi.. Previous dmesgs from 2.6.26.2 are "broken" too, since I pressed key a few times. I don't know, why timestamp information in printk() message was smooth. Maybe some commit after bisected one changes this behaviour. 

Latest two attachments are from kernel at bisected revision. There is ~30 seconds delay.
Comment 51 Dmitry Nezhevenko 2008-09-03 11:31:17 UTC
Just to link it. There is some progress with this (or almost this) issue in LKML in thread named "2.6.27-rc5 doesn't boot on a Pavilion laptop"

http://lkml.org/lkml/2008/9/3/228
http://lkml.org/lkml/2008/9/3/296
Comment 52 ykzhao 2008-09-08 01:56:55 UTC
Hi, Dmitry
    From the acpidump it seems that there is no _CST object and only C1 is supported on your system.
      >  C2 Latency : 0065 // greater than 100, C2 is not supported.
      >  C3 Latency : 03E9 // greater than 1000, C3 is not supported.
    
    Although only C1 is supported by the CPU, the system still can't work well. Maybe it is related with local apic timer.  The remaing issue is that the 2.6.26 kernel can't work while 2.6.25 can work.
 
    Will you please try the latest kernel(2.6.27-rc5) and see whether the problem still exists?
    thanks.
Comment 53 Dmitry Nezhevenko 2008-09-08 02:52:20 UTC
Hi.. 2.6.27-rc5 is still affected and problem exists.

There were a solution in LKML by Thomas Gleixner:
http://marc.info/?l=linux-kernel&m=122047790223527&w=4

I've added my "Tested-by: " here. After applying this patchset on top of mainline 2.6.27-rc5 kernel works with C1E as expected.

I'm not sure, were these patches accepted to mainline. 
Comment 54 Dmitry Nezhevenko 2008-09-15 14:11:27 UTC
Mainline kernel v2.6.27-rc6 is NOT affected and works as expected. So looks like regression is fixed. Closing this report.