Bug 16145
Summary: | Unable to boot unless "notsc" or "clocksource=hpet", or acpi_pad disabling the TSC | ||
---|---|---|---|
Product: | Timers | Reporter: | Tom Gundersen (teg) |
Component: | Other | Assignee: | john stultz (john.stultz) |
Status: | CLOSED CODE_FIX | ||
Severity: | blocking | CC: | acpi-bugzilla, lenb, maciej.rutecki, rjw, rui.zhang, suresh.b.siddha, venki |
Priority: | P1 | ||
Hardware: | All | ||
OS: | Linux | ||
Kernel Version: | 2.6.35-rc[1,2] | Subsystem: | |
Regression: | No | Bisected commit-id: | |
Bug Depends on: | |||
Bug Blocks: | 56331 | ||
Attachments: |
dmesg of successful boot
output of 'lspci -v' bisect log cpuinfo .config |
Description
Tom Gundersen
2010-06-07 13:11:39 UTC
Created attachment 26682 [details]
dmesg of successful boot
Created attachment 26683 [details]
output of 'lspci -v'
Mysterious. On successful boot, TSC is being marked unstable by ACPI C-state code. Not sure how the acpi pad change is making any difference here. Did you narrow down on ACPI pad change by doing git bisect? What happens if you, on rc2, disable ACPI_PROCESSOR_AGGREGATOR in you .config? Do you still see this fault? Yes, I bisected between 2.6.34 and 2.6.35-rc1. I'll attach the bisect log in case that is of any help. I am recompiling rc2 now. I'll let you know how it goes. Created attachment 26686 [details]
bisect log
Recompiled rc2 with ACPI_PROCESSOR_AGGREGATOR disabled. The error persists, but the message is slightly different (no backtrace): INIT: version 2.86 booting init[1]: segfult at ffffffff810111c9 ip ffffffff810111c9 sp 00007fff8bba30d8 error 15 init[1]: segfult at ffffffff810111c9 ip ffffffff810111c9 sp 00007fff8bba30d8 error 15 Kernel panic - not syncing: Attempted to kill init! Pid: 1, comm: init Not tainted 2.6.35-rc2-TEG #94 Anything else I should try? OK. I expect system to fail even with the patch reverted and PROCESSOR_AGGREGATOR not configured. Might be a good to check that case. Appears that the problem is not directly related to commit 0dc698b9. The system will not boot with or without this change when ACPI_PROCESSOR_AGGREGATOR is disabled. There seems to be some TSC related weirdness on this platform that is causing this problem. Somehow the mark_tsc_unstable() in acpi_pad.c is coming to the rescue of this system and letting it boot. But, once that call is taken out, either through the commit 0dc698b9 or removing PROCESSOR_AGGREGATOR in .config) system will fail to boot. What is surprising to me is that there is another mark_tsc_unstable() in acpi idle driver that doesn't seem to be helping. Let me look at this a bit more. Copying Len as well. I can verify your expectations :-) My system will only boot when both commit 0dc698b9 is reverted and ACPI_PROCESSOR_AGGREGATOR is disabled. you mean ACPI_PROCESSOR_AGGREGATOR is enabled? Sorry. Yes, that's what I meant: My system will only boot when both commit 0dc698b9 is reverted and ACPI_PROCESSOR_AGGREGATOR is enabled. Handled-By : Venkatesh Pallipadi <venki@google.com> Just a note in case someone else runs into this problem: A temporary fix (instead of reverting the patch in question), is to manualy disable TSC by forcing a different clocksource. I.e. passing "clocksource=hpet" to the kernel. acpi_pad is new with 2.6.32 Did any linux kernel before 2.6.32 successfully boot? If CONFIG_ACPI_PROCESSOR=m is being used, try CONFIG_ACPI_PROCESSOR=y Please include here the contents of /proc/cpuinfo oh, and please confirm that booting with "notsc" is a sufficient workaround Kernels before 2.6.32: I have been booting without problems since at least 2008, and 2.6.32 was late 2009. I can try 2.6.31 tomorrow to be certain. I'm using CONFIG_ACPI_PROCESSOR=y. Created attachment 26704 [details]
cpuinfo
Yes, "notsc" works as well. May be there are no C-states exported in ACPI and TSC is only getting disabled by pad. What does # grep . /sys/devices/system/cpu/cpu*/cpuidle/*/* look like after a successful boot with this revert? I don't seem to find any TSC related errata for this CPU and the failure signature seems very strange too. Very odd. There might be something else wrong with my system: There is no "cpuidle" directory in "/sys/devices/system/cpu/cpu*". I cannot recall if this has always been this way, or if it has changed recently. I have attached my config, in case you can point to some option I should try changing. (As mentioned earlier, I'm using this same kernel on a ThinkPad X60 and there C-states are shown in powertop). Created attachment 26707 [details]
.config
No C-states is OK on this platform. Looking at your dmesg again, I see mwait substates as 0x20, which means there is only C1 state supported on this platform. So, basically, TSC is not going to be disabled on this platform by C-state driver as TSC is "supposed' to run fine with C1 and will only be broken with C2 or deeper. I was wrong in comment #3 when I said C-state driver should be disabling TSC anyway. So, acpi_pad is the only driver disabling TSC and thats helping this platform survive. It will be interesting to see pre 2.6.32 behaviour as len mentioned in comment #16. Just checked pre 2.6.32: Either my recollection is completely wrong, or I have somehow changed my config since that time, because it was as Len predicted. 5e5027bd26ed4df735d29e66cd5c1c9b5959a587 (the merge of acpi_pad) boots fine, but its parent does not (with the same error as after 0dc698b9). Let me know if there are other things I should try. Did you happen to change BIOS on this box recently? I have upgraded to the newest BIOS version in the hopes of fixing my very slow boot (it only helped a bit). This is my current BIOS version: **** P5K BIOS 1201 1. Support new CPUs. Please refer to our website at: http://support.asus.com/cpusupport/cpusupport.aspx 2. Fix the system sometimes takes about 30 seconds to get into the OS when many USB devices are plugged into the system. **** I'm afraid I cannot remember what kernel version I was on at the time of the upgrade, so don't know if there was a regression in the BIOS. Thanks for clarifying that there exists no kernel that boots on this box w/o "notsc" or "clocksource=hpet" unless it is 2.6.32 or newer with acpi_pad disabling the TSC. I wonder how it ran in 2008. Can you try the distro release disk that you first installed? I'm confused now. I found an old Arch Linux (which is what I have always been using) 32 bit (I usually use 64 bit) install CD with 2.6.30-ARCH. It booted fine. I then compiled 2.6.30 manually, and it did not boot. I should have tried a 64bit install CD, but I first have to get hold of some blank CD's. I never understood how the Arch kernels are generated, so I don't know how to combine my own .config with the standard Arch stuff to debug further. if if you can get the .config for the arch kernel, either via /boot/config* or via /proc/config* then you can drop that into a source tree and build. That may get you to a place where you've got a config in hand and source tree in hand that works, and then you just need to figure out how to break it:-) Handled-By : Len Brown <lenb@kernel.org> Tom, any update on this? Hi, I tried following Len's advice and compile with a standard Arch config, but my first attempt failed (probably didn't compile in the right modules) and I have not had time to pursue it further (each compile takes 20 minutes). I'm quite busy with work at the moment, so I doubt I will manage to spend the required time to bisect the differences between the (probably working) Arch config and my (broken) config. To sum up: 2.6.30 works with Arch config and not with mine. All later kernels until 2.6.35-rc1 work with my config. 2.6.35-rc1 does not work with my config. I am not able to verify that Arch's config works with 2.6.35-rc1. I'm running rc3 at the moment and the problem is still present. I will keep testing the newer rc's and if things change I will let you know. If you have any patches you want me to test, or specific configs to change, I'd of course be happy to. Apparently, this is a .config problem, not a regression, so dropping from the list of recent regressions. Hi, I revisited this bug with 2.6.36-rc5, and it seems to be solved. dmesg says "Switching to clocksource tsc" and my system works fine. I don't know what caused the bug to be fixed though... Cheers, Tom |