Bug 11727

Summary: boot pauses often after processor module is loaded, resumes only after keypress
Product: Platform Specific/Hardware Reporter: Florian Hubold (doktor5000)
Component: i386Assignee: platform_i386
Status: CLOSED DOCUMENTED    
Severity: normal CC: alan, hpa, kernel.9.kronenpj, macro, mingo, tglx, yann
Priority: P1    
Hardware: All   
OS: Linux   
Kernel Version: 2.6.26-rc6 Subsystem:
Regression: Yes Bisected commit-id:
Attachments: cpuinfo output
dmesg output with boot option apic=verbose
contents of /proc/timer_list
acpidump

Description Florian Hubold 2008-10-10 07:26:25 UTC
Latest working kernel version: 2.6.26
Earliest failing kernel version: 2.6.27-rc6
Distribution: Mandriva Linux
Hardware Environment: Sony Laptop PCG-GRT815E

Problem Description: Machine requires key presses after processor module is
loaded to continue booting (probably interrupt generated by keyboard makes things progress).

The following boot options are the only workarounds i have found so far which do not have any side effects or regressions:

noapic lapic acpi=on

or

nosmp

The following boot options do not help:

apic lapic acpi=debug                           boot pauses

noapic nolapic acpi=debug                       everything seems normal,
but acpid can't be started due to "/proc/acpi/event - device or resource busy"

noapic lapic acpi=debug                 everything seems normal,
but acpid can't be started due to "/proc/acpi/event - device or resource busy"

noapic nolapic acpi=ht                  everything seems normal,
but after giving my password in KDM, the system doesn't go any further

noapic lapic acpi=ht                            everything seems normal,
but after giving my password in KDM, the system doesn't go any further

nolapic_timer                            boot pauses, does not continue at all

nosmb                                    normal boot, pauses at DM start


One remark: An uncommon thing about my system is that it uses a standard-issue socket 478 desktop-pentium 4. And the exact kernel version since this problem is happening i can only estimate, it happened since i switched from kernel 2.6.26 to kernel 2.6.27-rc5 or -rc6.something.
Comment 1 Florian Hubold 2008-10-10 07:27:50 UTC
Created attachment 18240 [details]
cpuinfo output
Comment 2 Florian Hubold 2008-10-10 07:28:36 UTC
Created attachment 18241 [details]
dmesg output with boot option apic=verbose
Comment 3 Florian Hubold 2008-10-10 07:29:28 UTC
Created attachment 18242 [details]
contents of /proc/timer_list
Comment 4 Florian Hubold 2008-10-10 07:30:37 UTC
Created attachment 18243 [details]
acpidump
Comment 5 Florian Hubold 2008-10-10 07:31:28 UTC
Ah, before i forget: The corresponding mandriva bug report is this one:
https://qa.mandriva.com/show_bug.cgi?id=44342

Please let me know if you need additional informations.
Comment 6 Andrew Morton 2008-10-10 10:28:54 UTC
Ingo, Thomas: could you take a look please?  It's a post-2.6.26 regression.
Comment 7 Thomas Gleixner 2008-10-11 03:34:48 UTC
Are you still on -rc6 or is the problem also in .27 final ?

Thanks,

	tglx
Comment 8 Florian Hubold 2008-10-11 09:41:50 UTC
Well ATM i'm on rc8.2 which would translate to rc8-git2 for Mandriva if i'm not mistaken, 2.6.27 final is not available yet, but i think it will
be at beginning of next week
I don't know exacty anymore the version of the first kernel that failed, i think the first 2.6.27 kernel i tried was 2.6.27-rc5 which had this problem, my guess is this problem exists since the first 2.6.27 prerelease.

This is what the mandriva maintainer wrote:

"Ok, my guesses then were right, probably something new in clockevents/timers
related to apic that comes in when Cx idle states are in use (that is what
happens when you load processor module)."



Oh, another important thing i forgot:

blacklisting thermal and processor module fixes the problem without changing boot options, maybe this helps you a bit more .
Comment 9 Adam Williamson 2008-10-18 00:36:56 UTC
This is also affecting Ubuntu:

https://bugs.launchpad.net/linux/+bug/272247
Comment 10 Thomas Gleixner 2008-10-18 01:15:48 UTC
Adam, he ubuntu bug is related to AMD C1E machines, which is a different problem. Please open a separate bug for that.

Maciej, can you please have a look as well ?

The kernel boots fine with "noapic lapic acpi=on" which means that disabling the io_apic is making the problem go away. The boot log says:

 IO-APIC (apicid-pin) 1-0 not connected.
 IO-APIC (apicid-pin) 1-17, 1-18, 1-19, 1-20, 1-21, 1-22, 1-23 not connected.
..TIMER: vector=0x31 apic1=0 pin1=16 apic2=-1 pin2=-1
..MP-BIOS bug: 8254 timer not connected to IO-APIC
...trying to set up timer (IRQ0) through the 8259A ...
..... (found apic 0 pin 16) ...
....... failed.
...trying to set up timer as Virtual Wire IRQ...
..... works.
Comment 11 Adam Williamson 2008-10-23 11:44:16 UTC
Oh, yes, you're right, sorry; I went off the comment in the MDV bug report and didn't notice the difference in hardware.
Comment 12 Florian Hubold 2008-11-08 10:15:57 UTC
Now i have tested with 2.6.27 final, report is still valid.

What may be of interest, is the fact that "clocksource=jiffies" alone also workarounds the problem, without blacklisting thermal and processor modules or changing apic/acpi. Stumbled about that accidentally.

Do you need any more information to debug/fix this problem?
Comment 13 Ingo Molnar 2008-11-08 11:00:49 UTC
> Do you need any more information to debug/fix this problem?

please check the tip/master tree:

  http://people.redhat.com/mingo/tip.git/README

just today i queued up fixes for similar problems. Can you still see 
the same problem with tip/master?

	Ingo
Comment 14 Florian Hubold 2008-11-13 00:18:14 UTC
Still valid with tip/master tree.

Or is there anything in particular that needs to be activated
for the fixes to take effect?
Comment 15 Florian Hubold 2009-03-05 12:21:21 UTC
(In reply to comment #14)
> Still valid with tip/master tree.
> 
> Or is there anything in particular that needs to be activated
> for the fixes to take effect?


Ping? 
Comment 16 Ingo Molnar 2009-03-06 02:54:42 UTC
> > Or is there anything in particular that needs to be 
> > activated for the fixes to take effect?
> 
> Ping?

Does it still occur with latest tip:master? In particular this 
fix:

a682604: rcu: Teach RCU that idle task is not quiscent state at boot

Might fix such long delays.

If it still occurs, could you also post a dmesg output that has 
been done with a kernel that has CONFIG_PRINTK_TIME=y enabled? 
That will print a seconds-since-bootup timestamp with each 
kernel message - that way we can see precisely where the delay 
is occuring.

Thanks,

	Ingo
Comment 17 Florian Hubold 2009-05-23 12:36:38 UTC
(In reply to comment #16)
> Does it still occur with latest tip:master? In particular this 
> fix:
> 
> a682604: rcu: Teach RCU that idle task is not quiscent state at boot
> 
> Might fix such long delays.

With latest tip (2.6.30-rc6) it has actually gotten worse. Directly after starting udev the system freezes hard, i was able to switch to VT1, but nothing else. Do you want a dmesg output from that kernel with PRINTK_TIME enabled and with clocksource=jiffies so that it actually boots?

Is there some other information that might help you fix this?

And please remember: The kernel doesn't make long delays during boot, it simply pauses until any key is pressed (repeatedly, maybe because of some interrupt problems?) otherwise the boot process does not continue after starting udev,
Comment 18 Florian Hubold 2009-05-31 14:11:29 UTC
ping?
Comment 19 Florian Hubold 2009-06-22 10:41:22 UTC
uhhm ... ping?
Comment 20 Florian Hubold 2009-07-23 19:58:17 UTC
Checked again with 2.6.31-rc2, still valid.
Boot process hangs when trying to start udev, could only reboot with Alt-SysRq.

Is there anything i can do to help getting this fixed?
Comment 21 Florian Hubold 2009-09-28 19:03:31 UTC
Hello!

I checked this one against 2.6.31 final, and it is still valid.
Pretty please, could we please get this fixed somehow? This has been in kernel since at least 5 stable releases, and prevents all newer distributions from even booting up.


What information or help can i provide to help get this fixed?
Comment 22 H. Peter Anvin 2009-09-28 20:02:36 UTC
I don't see this having been "git bisect"-ed.  Given that the most fundamental problem here is lack of information about a hardware-specific problem, since it is a regression, "git bisect" should be able to give fine-grained information about what broke it.
Comment 23 Florian Hubold 2009-09-28 21:46:13 UTC
Why wasn't this mentioned earlier? So bisecting this requires me to build and test what, like some dozen kernels between 2.6.26 and 2.6.31? Whoa ...


Would this be the right documentation to start with?
http://www.kernel.org/pub/software/scm/git/docs/git-bisect.html
Comment 24 H. Peter Anvin 2009-09-28 21:49:02 UTC
That would be a decent place to start, yes.

Latest working kernel version: 2.6.26
Earliest failing kernel version: 2.6.27-rc6

This is the obvious range to bisect within.
Comment 25 Florian Hubold 2009-10-06 14:16:51 UTC
As far as i can tell, The faulty commit happenend between 2.6.26 and 2.6.27-rc1, git bisect told me something about linux-acpi that had been merged in between, but i forgot to save that information.

My current problem is, that it always compiles the full kernel with every module, i thought it only recompiles the stuff that changed in between? This takes a whole lotta time just building the kernels, or am i doing something wrong?
Comment 26 Maciej W. Rozycki 2009-10-15 16:56:29 UTC
Hi Florian,

 Somehow I missed the APIC information in this entry before, sorry.  I can see
an excerpt from the boot log above:

 IO-APIC (apicid-pin) 1-0 not connected.
 IO-APIC (apicid-pin) 1-17, 1-18, 1-19, 1-20, 1-21, 1-22, 1-23 not connected.
..TIMER: vector=0x31 apic1=0 pin1=16 apic2=-1 pin2=-1
..MP-BIOS bug: 8254 timer not connected to IO-APIC
...trying to set up timer (IRQ0) through the 8259A ...
..... (found apic 0 pin 16) ...
....... failed.
...trying to set up timer as Virtual Wire IRQ...
..... works.

This is unusually weird -- the kernel reports it was told, presumably by ACPI
tables, that the IRQ0 timer interrupt is wired to the input #16 of the I/O
APIC #0.  While technically possible this would be the first time I have seen
such a configuration and given that it does not work, it is almost surely
wrong.

 It would be useful to get a dump of the ACPI tables for your system; I don't
remember off the head how to do this, but perhaps someone else on the CC list
for this bug will be able to tell you.  If this is indeed what the ACPI table
reports, then I suggest you look for a BIOS upgrade or report the bug to your
technical support as appropriate and for Linux to cover the buggy systems a
quirk will have to be added.

  Maciej
Comment 27 Florian Hubold 2009-10-16 15:00:06 UTC
(In reply to comment #26)
Well the laptop in question is like 5 years old now, and i don't know it has been in production before. There are no updates for nothing, it is considered a "legacy model" and i don't think i can convince sony into releasing a new bios update to support newer linux kernels. Can you imagine a realistic answer for such a request?

Wouldn't it be better to mark this bug as RESOLVED DOCUMENTED or something the like, given that an easy workaround is available? (boot option clocksource=jiffies)


And the acpi dump was provided as an attachment, isn't that the requested information? In any other case please someone tell me how to dump those needed acpi tables, please.
Comment 28 Florian Hubold 2009-10-16 15:00:45 UTC
Sorry for marking it as RESOLVED DOCUMENTED, this was just a mistake.
Comment 29 Maciej W. Rozycki 2009-10-16 15:44:38 UTC
Hi Florian,

 I missed the attachments, sorry -- yes, this is enough information,
thank you.

 Given the circumstances I agree chances you'd convince Sony are slim.
I thought this is about more recent a piece of hardware.

 As this is a regression -- could you please obtain a bootstrap log
(with apic=verbose) from the last version of the kernel that worked for
you?  The changes that were made in the suspected area do not explain
the regression -- the mode of failure indicates the timer shouldn't
have worked before either, so clearly there is something wrong with my
understanding of the problem.  Perhaps the reason is not the APIC, but
a coincidential change elsewhere.  I wonder if it's possible that the
Virtual Wire IRQ timer setup does not work anymore at all -- it's
hardly ever used and is unique in that it is the only case where Linux
sets up a local APIC interrupt in the native mode.

  Maciej
Comment 30 Yann Droneaud 2010-11-02 11:29:17 UTC
Have you tried to use option acpi_skip_timer_override
as reported here:

https://bugzilla.kernel.org/show_bug.cgi?id=15289

and here:

https://bugs.launchpad.net/ubuntu/+source/linux/+bug/254668/comments/125
Comment 31 Florian Hubold 2010-11-02 18:22:00 UTC
Nope. I have sold the laptop last month, because i needed something a bit spiffier ;) . Anyway, i ran into some people who got a similar or the same problem, if i meet them again and the problem still persists, i'll try this and report back. But don't bet on that, at least not in the near future.

For my part, this can be closed as RESOLVED DOCUMENTED like i mentioned above, because you only offer a workaround, too.


Kind Regards