Bug 11101
Description
Dmitry Nezhevenko
2008-07-16 09:33:12 UTC
Created attachment 16842 [details]
2.6.26 dmesg
Created attachment 16843 [details]
2.6.26 dmesg with acpi=off (works)
Created attachment 16844 [details]
lspci && lsmod
Created attachment 16845 [details]
2.6.25.6 dmesg (that works great on this machine)
PS. I've assigned this bug to ACPI, since there is no such problem with acpi=off. Sorry if this incorrect Hi, Dmitry Will you please try git-bisect to identify which commit causes this regression? Will you please attach the output of acpidump? thanks. Created attachment 16857 [details]
acpidump output
Attaching acpidump output
I'll try to use git-bisect later today or tomorrow. Finally I've done git-bisect. 9713277607f9eac7d655c6854dd92bc2ce1b6f02 is first bad commit commit 9713277607f9eac7d655c6854dd92bc2ce1b6f02 Author: Glauber de Oliveira Costa <gcosta@redhat.com> Date: Wed Mar 19 14:25:43 2008 -0300 x86: boot cpus from cpu_up, instead of prepare_cpus After all the infrastructure work, we're now prepared to boot the cpus from cpu_up, and not from prepare_cpus. So the difference between cold boot and hotplug is effectively over, and the functions are used to the purposes they're meant to. Signed-off-by: Glauber Costa <gcosta@redhat.com> Signed-off-by: Ingo Molnar <mingo@elte.hu> :040000 040000 c191cf15825af14a8f4e2842105c33e19ad9df1d 3a6e4c6a5489a20722e83c98204f16d29bdc01fd M arch Please let me know, if you need any other info about this laptop. add Glauber and Ingo to the cc list. :) as this is a regression, cc Rafael. Sorry for the delay in answering this, I was travelling and just returned to life today. I've seen this problem before, but got no success in fixing it, since I can't reproduce. This seems to happen only in x86_64 machines (this is your case, right?), and the furthest I came was to the conclusion that it was something related to udelay waits that were probably taking too long. I'll give this a further thought. It's Turion64 X2 based laptop with nvidia chipset. Machine supports x86_64 however I'm running 32-bit kernel and distro. dmesg and all other attached files are from 32 bit environment. Hi again. I've spent a bit more time with this issue. First of all I've tried revision before 9713277607f9eac7d655c6854dd92bc2ce1b6f02 and it works. So now I'm sure that bisect result is correct. Also there is option named "Advanced powersaving" in BIOS of this laptop and it's currently turned on. Turning this option off fixes this issue too. I don't know exactly what does it do, so I've compared dmesg output from 2.6.25 with this option turned on and turned off. One of important changes is this line: AMD C1E detected late. Force timer broadcast. So I think that it's enables/disables this "AMD C1E" feature. On 2.6.25 this option reduces power usage from battery (according to powertop). Also I've tried to decompile and compile again DSDT from ACPI according to some googled manual. There was 0 errors and 7 warnings. All of them looks same, just different line numbers. Here is one of them: DSDT.dsl 4825: Acquire (MUTE, 0x03E8) Warning 1103 - ^ Possible operator timeout is ignored I don't know, could this be cause of this problem. Please let me know, if I can help. I'm ready to test anything :) Created attachment 16955 [details]
Enable IO-APIC after all cpus are booted
Could you please try the following patch?
It's not a fix, but just to confirm my theory.
I think that the problem is that the IO-APIC is now initialized before all cpus are put up. The dmesg that works complains about the irqchip not being connected to pin0, and the message goes away in the broken kernel.
Let's see.
I've patched 2.6.26 release. Same freeze and I should press buttons to continue boot. Please let me know, if you need dmesg from this patched kernel. I'll build modules then (currently I'm building only zImage). Yes, please. Created attachment 16956 [details]
dmesg with applied patch "Enable IO-APIC after all cpus are booted"
Here is my dmesg. I've marked, where it freezes. Currently I'm trying to understand, why there are no first lines (like version, etc). I've CONFIG_LOG_BUF_SHIFT=17 and it should be enough.
(In reply to comment #17) > Created an attachment (id=16956) [details] > dmesg with applied patch "Enable IO-APIC after all cpus are booted" > > Here is my dmesg. I've marked, where it freezes. Currently I'm trying to > understand, why there are no first lines (like version, etc). I've > CONFIG_LOG_BUF_SHIFT=17 and it should be enough. Does "dmesg -s 1000000" give more output? Created attachment 16957 [details]
full dmesg with applied patch "Enable IO-APIC after all cpus are booted"
Please ignore my last message. /bin/dmesg was a symlink to busybox on this tiny rootfs.
Hi. What happens if you boot with the "nolapic_timer" option ? I think this one might be the same as another one just reported on lkml Created attachment 17014 [details]
dmesg with nolapic_timer
Hi. With this option kernel boots as expected withod any freezes.
Created attachment 17015 [details] patch for c1e idle C1E machines (which is your case), have a broken lapic timer. As we changed the smp bootup process a little bit, the lapic timer is now being used, and it's probably what is hurting your box. At first, the solution would be to disable it again. However, ingo's tree has a patch from tglx that enables the lapic timer on such machines, by means of a different idle function. So disabling it would be a step backwards. Can you please confirm that your tree _do not_ have the following cset? commit aa276e1cafb3ce9d01d1e837bcd67e92616013ac Author: Thomas Gleixner <tglx@linutronix.de> Date: Mon Jun 9 19:15:00 2008 +0200 x86, clockevents: add C1E aware idle function I'm attaching the patch here. Does it apply on your tree? Can you confirm that it works as expected with the patch? (Maybe we do have to disable it in some cases, who knows...) Thanks. 1. There is no aa276e1cafb3ce9d01d1e837bcd67e92616013ac in my tree. At least git checkout aa276e1cafb3ce9d01d1e837bcd67e92616013ac reports, that pathspec did not match any file(s). 2. Patch does NOT apply cleanly to my tree (I'm using v2.6.26 release): patching file arch/x86/kernel/apic_32.c patching file arch/x86/kernel/apic_64.c patching file arch/x86/kernel/cpu/amd.c Reversed (or previously applied) patch detected! Assume -R? [n] n Apply anyway? [n] n Skipping patch. 2 out of 2 hunks ignored -- saving rejects to file arch/x86/kernel/cpu/amd.c.rej can't find file to patch at input line 139 Perhaps you used the wrong -p or --strip option? The text leading up to this was: -------------------------- |diff --git a/arch/x86/kernel/cpu/amd_64.c b/arch/x86/kernel/cpu/amd_64.c |index f5fc161..f8d2058 100644 |--- a/arch/x86/kernel/cpu/amd_64.c |+++ b/arch/x86/kernel/cpu/amd_64.c -------------------------- File to patch: Skip this patch? [y] y Skipping patch. 2 out of 2 hunks ignored patching file arch/x86/kernel/process.c Hunk #2 succeeded at 138 with fuzz 1 (offset -82 lines). Hunk #3 FAILED at 217. 1 out of 3 hunks FAILED -- saving rejects to file arch/x86/kernel/process.c.rej patching file include/asm-x86/apic.h patching file kernel/time/tick-broadcast.c Looks like something was already applied. I've fixed one FAILED hunk by hands. It was easy enough. However I can't compile kernel: arch/x86/kernel/process.c: In function ‘c1e_idle’: arch/x86/kernel/process.c:175: error: ‘MSR_K8_INT_PENDING_MSG’ undeclared (first use in this function) arch/x86/kernel/process.c:175: error: (Each undeclared identifier is reported only once arch/x86/kernel/process.c:175: error: for each function it appears in.) arch/x86/kernel/process.c:176: error: ‘K8_INTP_C1E_ACTIVE_MASK’ undeclared (first use in this function) make[1]: *** [arch/x86/kernel/process.o] Error 1 Which tree should I use to test it? Are you using linus' ? Just tell me what tree/tip-commit you're using, I'll prepare a patch that applies cleanly there. I've cloned git://git.kernel.org/pub/scm/linux/kernel/git/stable/linux-2.6.26.y.git (tag v2.6.26) when you asked me for git-bisect. As far as I understand, it's 2.6.26 release. If you want, I can switch to any other tree. In this case, you can switch to git://git.kernel.org/pub/scm/linux/kernel/git/x86/linux-2.6-x86.git. It should have it. Created attachment 17016 [details]
2.6.26 with applied c1e patch
sorry for delay. I'm currently at:
commit 8cf840bad1f38c23650da023f33331c25ebd5b58
Merge: bc24587... 0e241ff...
Author: Ingo Molnar
Date: Mon Jul 28 18:32:31 2008 +0200
Merge branch 'core/urgent'
I can't boot this kernel without nolapic_timer option. It just freezes and there is no way to continue boot. Here are last lines that I see:
PCI: Using ACPI for IRQ routing
hpet0: at MMIO 0xfed00000, IRQ 2, 8, 31
hpet0: 3 32-bit timers, 25000000 Hz
ACPI: RTC can wake from S4
(attached dmesg is from kernel with nolapic_timer) Hmm. It looks like this tree contains patch aa276e1cafb3ce9d01d1e837bcd67e92616013ac. Is it ok? re-assign to Glauber. :) Created attachment 17147 [details]
2.6.26 x86_64 dmesg
Hi guys. I've just setup x86_64 chroot and was able to build 64bit kernel. This tree (I'm at 8cf840bad1f38c23650da023f33331c25ebd5b58) do NOT work on x86_64 too. It freezes with same last lines. And pressing buttons do not resume booting. nolapic_timer do NOT helps. It still unbootable.
Then I've tried to boot debian 2.6.26 x86_64 kernel (if you want, I can recompile vanilla 2.6.26 too). And it boots. So looks like we have two regressions with my laptop:
1. 2.6.26 release freezes during boot in 32bit mode, but works in x86_64.
2. Both i386 and x86_64 freezes on 8cf840bad1f38c23650da023f33331c25ebd5b58.
I'm attaching dmesg from debian 2.6.26 x86_64 kernel, that works without any special kernel parameters.
From now, I'm able to test both i386 and x86_64 kernels on this laptop.
dmitry, does this problem still exist in the latest kernel release? I'll try to reproduce it with 2.6.27-rc3 today (both x86 and x86_64 builds) and let you know. Please let me know, if I should test another release or some git tree. Created attachment 17305 [details]
2.6.27-rc3, booted with nolapic_timer
i386: Problem is still exists with 2.6.27-rc3. I can boot this kernel only with nolapic_timer option. According to dmesg with this option, kernel correctly detects that C1E is enabled:
[ 0.004000] using C1E aware idle routine
[ 0.156009] System has C1E enabled
I'm attaching dmesg from this 2.6.27-rc3 kernel that was booted with "nolapic_timer". I can't get dmesg without this option
I will be able to provide x86_64 dmesg tomorrow, if you need it.
I experience this problem as well on my Dell Inspiron 1501, x86 running ArchLinux, kernel 2.6.26.2. Same symptoms. I'll attach my dmesg output, for booting with acpi, as soon as I figure out how. Created attachment 17421 [details]
dmesg output after having booted WITH acpi (and being very patient, and pressing the enter button a bunch at a later point) for kernel 2.6.26.2.
Hmm.. So one more affected system. You should be able to boot kernel with nolapic_timer. Use "Create new attachment" link at the bottom of this page (end of attachments table) To confirm, nolapic_timer also allows me to boot without further obvious issues. I'll post my dmesg from it as well when I get the chance. Created attachment 17439 [details]
force c1e idle routine
Guys, can you please check what happens when you force the kernel to use the c1e idle routine ?
you can do that with the patch I'm attaching to this message.
Hi.. I've tried this patch on 2.6.27-rc4 system. No difference. And I still can't boot machine with C1E at all. So I'm unable to get dmesg. So I've added "!!! " to printk message in patch (so it will be easy to see it) and boot_delay=100 to kernel parameters. There were two "!!! using C1E aware idle routine" messages during boot (I think because this machine contains two CPU cores). Last messages before freezes are still: hpet0: at MMIO 0xfed00000, IRQ 2, 8, 31 hpet0: 3 32-bit timers, 25000000 Hz ACPI: RTC can wake from S4 I will try to add more printk messages to process.c. Maybe in all idle functions. I think, that your idea was, that for some reason kernel use default_idle instead of c1e_idle. Currently I can confirm, that kernel freezes using c1e_idle(). I've added some printk() calls to c1e_idle. There were a lot of invokations of c1e_idle() during boot. Some of them were just returned because of need_resched(). Here is what's going in this file just after last kernel message (ACPI: RTC can wake from S4): 1. c1e_idle were invoked twice. (I think this is because system is SMP). 2. one invokation was returned beacuse of "if (need_resched()) return" code. 3. second invokation passed to "if (c1e_detected) {" block 4. block "if (!cpu_isset(cpu, c1e_mask)) {" was not called. 5. clockevents_notify(CLOCK_EVT_NOTIFY_BROADCAST_ENTER, &cpu) was called 6. system freeze printk message just after clockevents_notify() was not called. I will attach my "debug" patch and photo with last messages on the screen. Created attachment 17443 [details]
Verbose debug messages for arch/x86/kernel/process.c
Created attachment 17444 [details]
Screenshot of last messages with applied force-c1idle.patch and debug info patch
Hope, that this information will be useful. I will try to take a look to clockevents_notify() however I think, that I have no enough kernel hacking skills to fix it
Ok, before digging into it, I noticed we have no failing dmesg with apic debug messages turned on. I'm thinking that maybe we're broadcasting too fast , and this might be the problem. apic=debug command line switch should give us more information. Can you provide two dmesgs, with and without nolapic_timer? I understand you can't boot to completion, so you might want to rollback a few commits up to the point you could , just for this. I've added a bit more printk to some files. And traced it to kernel/notifier.c. System freezes somewhere in ret = nb->notifier_call(nb, val, v); I can't debug it anymore since I don't know function name that is referenced by "nb" and I don't know any way, how to get any useful information from struct notifier_block. As far sa I understand, some code/module subscribes to notifications about clockevents and freezes for some reason. Created attachment 17446 [details]
2.6.26 dmesg with apic=debug
Created attachment 17447 [details]
2.6.26 dmesg with apic=debug nolapic_timer
I've used 2.6.26.3 release since I know that it's bootable. If you need, I can bisect, which revision make system unbootable without nolapic_timer.
No. It should use a broken kernel, but one that you can boot by pressing enter. The first one after the commit you bisected to is fine. Created attachment 17464 [details]
17446: bisected dmesg with apic=debug
Created attachment 17465 [details]
bisected dmesg with apic=debug nolapic_timer
Hi.. Previous dmesgs from 2.6.26.2 are "broken" too, since I pressed key a few times. I don't know, why timestamp information in printk() message was smooth. Maybe some commit after bisected one changes this behaviour.
Latest two attachments are from kernel at bisected revision. There is ~30 seconds delay.
Just to link it. There is some progress with this (or almost this) issue in LKML in thread named "2.6.27-rc5 doesn't boot on a Pavilion laptop" http://lkml.org/lkml/2008/9/3/228 http://lkml.org/lkml/2008/9/3/296 Hi, Dmitry From the acpidump it seems that there is no _CST object and only C1 is supported on your system. > C2 Latency : 0065 // greater than 100, C2 is not supported. > C3 Latency : 03E9 // greater than 1000, C3 is not supported. Although only C1 is supported by the CPU, the system still can't work well. Maybe it is related with local apic timer. The remaing issue is that the 2.6.26 kernel can't work while 2.6.25 can work. Will you please try the latest kernel(2.6.27-rc5) and see whether the problem still exists? thanks. Hi.. 2.6.27-rc5 is still affected and problem exists. There were a solution in LKML by Thomas Gleixner: http://marc.info/?l=linux-kernel&m=122047790223527&w=4 I've added my "Tested-by: " here. After applying this patchset on top of mainline 2.6.27-rc5 kernel works with C1E as expected. I'm not sure, were these patches accepted to mainline. Mainline kernel v2.6.27-rc6 is NOT affected and works as expected. So looks like regression is fixed. Closing this report. |