Bug 16145 - Unable to boot unless "notsc" or "clocksource=hpet", or acpi_pad disabling the TSC
Summary: Unable to boot unless "notsc" or "clocksource=hpet", or acpi_pad disabling th...
Status: CLOSED CODE_FIX
Alias: None
Product: Timers
Classification: Unclassified
Component: Other (show other bugs)
Hardware: All Linux
: P1 blocking
Assignee: john stultz
URL:
Keywords:
Depends on:
Blocks: 56331
  Show dependency tree
 
Reported: 2010-06-07 13:11 UTC by Tom Gundersen
Modified: 2013-04-09 06:23 UTC (History)
7 users (show)

See Also:
Kernel Version: 2.6.35-rc[1,2]
Subsystem:
Regression: No
Bisected commit-id:


Attachments
dmesg of successful boot (50.45 KB, text/plain)
2010-06-07 13:13 UTC, Tom Gundersen
Details
output of 'lspci -v' (7.71 KB, text/plain)
2010-06-07 13:13 UTC, Tom Gundersen
Details
bisect log (2.18 KB, text/plain)
2010-06-07 17:08 UTC, Tom Gundersen
Details
cpuinfo (2.92 KB, application/octet-stream)
2010-06-09 22:46 UTC, Tom Gundersen
Details
.config (64.81 KB, text/plain)
2010-06-10 01:04 UTC, Tom Gundersen
Details

Description Tom Gundersen 2010-06-07 13:11:39 UTC
It looks like commit 0dc698b93f3eecdda43b22232131324eb41e510c
causes init to segfault at boot time on one of my computers.

Reverting 0dc698b9 on top of 2.6.35-rc2 solves the problem.

FWIW, the problem does not occur on my laptop (ThinkPad X60) with (mostly) the same software and the same kernel.

The output on failed boot (2.6.35-rc2) is:

INIT: version 2.86 booting
init[1]: segfult at ffffffff8100896d ip ffffffff8100896d sp 00007fff734b04c8 error 15
init[1]: segfult at ffffffff8100896d ip ffffffff8100896d sp 00007fff734b04c8 error 15
Kernel panic - not syncing: Attempted to kill init!
Pid: 1, comm: init Not tainted 2.6.35-rc2-TEG #93
Call Trace:
  [ffffffff814d1a74] panic+0x86/0xf4
  [ffffffff8108d5b4] ? perf_event_exit_task+0x27/0x112
  [ffffffff8103d118] do_exit+0x6d/0x687
  [ffffffff8103d9a5] do_group_exit+0x70/0x98
  [ffffffff8104a1e4] get_signal_to_deliver+0x330/0x34e
  [ffffffff810014da] do_signal+0x6d/0x65b
  [ffffffff814d1b1e] ? printk+0x3c/0x3e
  [ffffffff8100896d] ? rdtsc_barrier+0x0/0xc
  [ffffffff814d1b1e] ? printk+0x3c/0x3e
  [ffffffff81020811] ? __bad_area_nosemaphore+0x179/0x1a3
  [ffffffff8100896d] ? rdtsc_barrier+0x0/0xc
  [ffffffff81001b03] do_notify_resume+0x27/0x47
  [ffffffff814d3a9a] retint_signal+0x3d/0x83
  [ffffffff8100896d] ? rdtsc_barrier+0x0/0xc
[drm:drm_fb_helper_panic] *ERROR* panic occurred, switching back to text console
Comment 1 Tom Gundersen 2010-06-07 13:13:21 UTC
Created attachment 26682 [details]
dmesg of successful boot
Comment 2 Tom Gundersen 2010-06-07 13:13:55 UTC
Created attachment 26683 [details]
output of 'lspci -v'
Comment 3 Venkatesh Pallipadi 2010-06-07 16:58:12 UTC
Mysterious.

On successful boot, TSC is being marked unstable by ACPI C-state code. Not sure how the acpi pad change is making any difference here.

Did you narrow down on ACPI pad change by doing git bisect?
What happens if you, on rc2, disable ACPI_PROCESSOR_AGGREGATOR in you .config? Do you still see this fault?
Comment 4 Tom Gundersen 2010-06-07 17:08:05 UTC
Yes, I bisected between 2.6.34 and 2.6.35-rc1. I'll attach the bisect log in case that is of any help.

I am recompiling rc2 now. I'll let you know how it goes.
Comment 5 Tom Gundersen 2010-06-07 17:08:38 UTC
Created attachment 26686 [details]
bisect log
Comment 6 Tom Gundersen 2010-06-07 18:03:16 UTC
Recompiled rc2 with ACPI_PROCESSOR_AGGREGATOR disabled. The error persists, but the message is slightly different (no backtrace):

INIT: version 2.86 booting
init[1]: segfult at ffffffff810111c9 ip ffffffff810111c9 sp 00007fff8bba30d8
error 15
init[1]: segfult at ffffffff810111c9 ip ffffffff810111c9 sp 00007fff8bba30d8
error 15
Kernel panic - not syncing: Attempted to kill init!
Pid: 1, comm: init Not tainted 2.6.35-rc2-TEG #94




Anything else I should try?
Comment 7 Venkatesh Pallipadi 2010-06-07 18:24:28 UTC
OK. I expect system to fail even with the patch reverted and PROCESSOR_AGGREGATOR not configured. Might be a good to check that case.

Appears that the problem is not directly related to commit 0dc698b9. The system will not boot with or without this change when ACPI_PROCESSOR_AGGREGATOR is disabled.

There seems to be some TSC related weirdness on this platform that is causing this problem. Somehow the mark_tsc_unstable() in acpi_pad.c is coming to the rescue of this system and letting it boot. But, once that call is taken out, either through the commit 0dc698b9 or removing PROCESSOR_AGGREGATOR in .config) system will fail to boot. What is surprising to me is that there is another mark_tsc_unstable() in acpi idle driver that doesn't seem to be helping. Let me look at this a bit more. Copying Len as well.
Comment 8 Tom Gundersen 2010-06-07 18:49:51 UTC
I can verify your expectations :-)

My system will only boot when both commit 0dc698b9 is reverted and ACPI_PROCESSOR_AGGREGATOR is disabled.
Comment 9 Venkatesh Pallipadi 2010-06-07 19:36:34 UTC
you mean ACPI_PROCESSOR_AGGREGATOR is enabled?
Comment 10 Tom Gundersen 2010-06-07 19:51:44 UTC
Sorry. Yes, that's what I meant:

My system will only boot when both commit 0dc698b9 is reverted and
ACPI_PROCESSOR_AGGREGATOR is enabled.
Comment 11 Rafael J. Wysocki 2010-06-07 21:02:29 UTC
Handled-By : Venkatesh Pallipadi <venki@google.com>
Comment 12 Tom Gundersen 2010-06-09 02:06:19 UTC
Just a note in case someone else runs into this problem:

A temporary fix (instead of reverting the patch in question), is to manualy disable TSC by forcing a different clocksource. I.e. passing "clocksource=hpet" to the kernel.
Comment 13 Len Brown 2010-06-09 22:10:25 UTC
acpi_pad is new with 2.6.32

Did any linux kernel before 2.6.32 successfully boot?
Comment 14 Len Brown 2010-06-09 22:16:51 UTC
If CONFIG_ACPI_PROCESSOR=m is being used, try CONFIG_ACPI_PROCESSOR=y

Please include here the contents of /proc/cpuinfo
Comment 15 Len Brown 2010-06-09 22:17:46 UTC
oh, and please confirm that booting with "notsc" is a sufficient workaround
Comment 16 Tom Gundersen 2010-06-09 22:45:32 UTC
Kernels before 2.6.32: I have been booting without problems since at least 2008, and 2.6.32 was late 2009. I can try 2.6.31 tomorrow to be certain.

I'm using CONFIG_ACPI_PROCESSOR=y.
Comment 17 Tom Gundersen 2010-06-09 22:46:40 UTC
Created attachment 26704 [details]
cpuinfo
Comment 18 Tom Gundersen 2010-06-09 22:51:27 UTC
Yes, "notsc" works as well.
Comment 19 Venkatesh Pallipadi 2010-06-09 23:02:36 UTC
May be there are no C-states exported in ACPI and TSC is only getting disabled by pad. What does
# grep . /sys/devices/system/cpu/cpu*/cpuidle/*/*

look like after a successful boot with this revert?

I don't seem to find any TSC related errata for this CPU and the failure signature seems very strange too.
Comment 20 Tom Gundersen 2010-06-10 01:03:15 UTC
Very odd. There might be something else wrong with my system:

There is no "cpuidle" directory in "/sys/devices/system/cpu/cpu*". I cannot recall if this has always been this way, or if it has changed recently.

I have attached my config, in case you can point to some option I should try changing. 

(As mentioned earlier, I'm using this same kernel on a ThinkPad X60 and there C-states are shown in powertop).
Comment 21 Tom Gundersen 2010-06-10 01:04:05 UTC
Created attachment 26707 [details]
.config
Comment 22 Venkatesh Pallipadi 2010-06-10 01:16:39 UTC
No C-states is OK on this platform. Looking at your dmesg again, I see mwait substates as 0x20, which means there is only C1 state supported on this platform.

So, basically, TSC is not going to be disabled on this platform by C-state driver as TSC is "supposed' to run fine with C1 and will only be broken with C2 or deeper. I was wrong in comment #3 when I said C-state driver should be disabling TSC anyway.

So, acpi_pad is the only driver disabling TSC and thats helping this platform survive. It will be interesting to see pre 2.6.32 behaviour as len mentioned in comment #16.
Comment 23 Tom Gundersen 2010-06-10 01:30:43 UTC
Just checked pre 2.6.32:

Either my recollection is completely wrong, or I have somehow changed my config since that time, because it was as Len predicted.

5e5027bd26ed4df735d29e66cd5c1c9b5959a587 (the merge of acpi_pad) boots fine, but its parent does not (with the same error as after 0dc698b9).

Let me know if there are other things I should try.
Comment 24 Venkatesh Pallipadi 2010-06-10 15:39:01 UTC
Did you happen to change BIOS on this box recently?
Comment 25 Tom Gundersen 2010-06-10 16:00:53 UTC
I have upgraded to the newest BIOS version in the hopes of fixing my very slow boot (it only helped a bit).

This is my current BIOS version:

****
P5K BIOS 1201
1. Support new CPUs. Please refer to our website at: http://support.asus.com/cpusupport/cpusupport.aspx
2. Fix the system sometimes takes about 30 seconds to get into the OS when many USB devices are plugged into the system.
****

I'm afraid I cannot remember what kernel version I was on at the time of the upgrade, so don't know if there was a regression in the BIOS.
Comment 26 Len Brown 2010-06-11 17:24:30 UTC
Thanks for clarifying that there exists no kernel that boots
on this box w/o "notsc" or "clocksource=hpet" unless
it is 2.6.32 or newer with acpi_pad disabling the TSC.

I wonder how it ran in 2008.  Can you try the distro release
disk that you first installed?
Comment 27 Tom Gundersen 2010-06-11 21:32:57 UTC
I'm confused now.

I found an old Arch Linux (which is what I have always been using) 32 bit (I usually use 64 bit) install CD with 2.6.30-ARCH. It booted fine.

I then compiled 2.6.30 manually, and it did not boot.

I should have tried a 64bit install CD, but I first have to get hold of some blank CD's.

I never understood how the Arch kernels are generated, so I don't know how to combine my own .config with the standard Arch stuff to debug further.
Comment 28 Len Brown 2010-06-11 22:19:07 UTC
if if you can get the .config for the arch kernel,
either via /boot/config*
or via /proc/config*
then you can drop that into a source tree and build.
That may get you to a place where you've got a config in hand
and source tree in hand that works, and then you just need
to figure out how to break it:-)
Comment 29 Rafael J. Wysocki 2010-06-12 21:39:50 UTC
Handled-By : Len Brown <lenb@kernel.org>
Comment 30 Zhang Rui 2010-06-30 08:23:23 UTC
Tom, any update on this?
Comment 31 Tom Gundersen 2010-06-30 11:07:56 UTC
Hi,

I tried following Len's advice and compile with a standard Arch config, but my first attempt failed (probably didn't compile in the right modules) and I have not had time to pursue it further (each compile takes 20 minutes). I'm quite busy with work at the moment, so I doubt I will manage to spend the required time to bisect the differences between the (probably working) Arch config and my (broken) config.

To sum up:
2.6.30 works with Arch config and not with mine.
All later kernels until 2.6.35-rc1 work with my config.
2.6.35-rc1 does not work with my config.
I am not able to verify that Arch's config works with 2.6.35-rc1.

I'm running rc3 at the moment and the problem is still present. I will keep testing the newer rc's and if things change I will let you know.

If you have any patches you want me to test, or specific configs to change, I'd of course be happy to.
Comment 32 Rafael J. Wysocki 2010-07-09 21:31:38 UTC
Apparently, this is a .config problem, not a regression, so dropping from the list of recent regressions.
Comment 33 Tom Gundersen 2010-09-27 20:35:26 UTC
Hi,

I revisited this bug with 2.6.36-rc5, and it seems to be solved. dmesg says "Switching to clocksource tsc" and my system works fine.

I don't know what caused the bug to be fixed though...

Cheers,

Tom

Note You need to log in before you can comment on or make changes to this bug.