Latest working kernel version: 2.6.26 Earliest failing kernel version: 2.6.27-rc6 Distribution: Mandriva Linux Hardware Environment: Sony Laptop PCG-GRT815E Problem Description: Machine requires key presses after processor module is loaded to continue booting (probably interrupt generated by keyboard makes things progress). The following boot options are the only workarounds i have found so far which do not have any side effects or regressions: noapic lapic acpi=on or nosmp The following boot options do not help: apic lapic acpi=debug boot pauses noapic nolapic acpi=debug everything seems normal, but acpid can't be started due to "/proc/acpi/event - device or resource busy" noapic lapic acpi=debug everything seems normal, but acpid can't be started due to "/proc/acpi/event - device or resource busy" noapic nolapic acpi=ht everything seems normal, but after giving my password in KDM, the system doesn't go any further noapic lapic acpi=ht everything seems normal, but after giving my password in KDM, the system doesn't go any further nolapic_timer boot pauses, does not continue at all nosmb normal boot, pauses at DM start One remark: An uncommon thing about my system is that it uses a standard-issue socket 478 desktop-pentium 4. And the exact kernel version since this problem is happening i can only estimate, it happened since i switched from kernel 2.6.26 to kernel 2.6.27-rc5 or -rc6.something.
Created attachment 18240 [details] cpuinfo output
Created attachment 18241 [details] dmesg output with boot option apic=verbose
Created attachment 18242 [details] contents of /proc/timer_list
Created attachment 18243 [details] acpidump
Ah, before i forget: The corresponding mandriva bug report is this one: https://qa.mandriva.com/show_bug.cgi?id=44342 Please let me know if you need additional informations.
Ingo, Thomas: could you take a look please? It's a post-2.6.26 regression.
Are you still on -rc6 or is the problem also in .27 final ? Thanks, tglx
Well ATM i'm on rc8.2 which would translate to rc8-git2 for Mandriva if i'm not mistaken, 2.6.27 final is not available yet, but i think it will be at beginning of next week I don't know exacty anymore the version of the first kernel that failed, i think the first 2.6.27 kernel i tried was 2.6.27-rc5 which had this problem, my guess is this problem exists since the first 2.6.27 prerelease. This is what the mandriva maintainer wrote: "Ok, my guesses then were right, probably something new in clockevents/timers related to apic that comes in when Cx idle states are in use (that is what happens when you load processor module)." Oh, another important thing i forgot: blacklisting thermal and processor module fixes the problem without changing boot options, maybe this helps you a bit more .
This is also affecting Ubuntu: https://bugs.launchpad.net/linux/+bug/272247
Adam, he ubuntu bug is related to AMD C1E machines, which is a different problem. Please open a separate bug for that. Maciej, can you please have a look as well ? The kernel boots fine with "noapic lapic acpi=on" which means that disabling the io_apic is making the problem go away. The boot log says: IO-APIC (apicid-pin) 1-0 not connected. IO-APIC (apicid-pin) 1-17, 1-18, 1-19, 1-20, 1-21, 1-22, 1-23 not connected. ..TIMER: vector=0x31 apic1=0 pin1=16 apic2=-1 pin2=-1 ..MP-BIOS bug: 8254 timer not connected to IO-APIC ...trying to set up timer (IRQ0) through the 8259A ... ..... (found apic 0 pin 16) ... ....... failed. ...trying to set up timer as Virtual Wire IRQ... ..... works.
Oh, yes, you're right, sorry; I went off the comment in the MDV bug report and didn't notice the difference in hardware.
Now i have tested with 2.6.27 final, report is still valid. What may be of interest, is the fact that "clocksource=jiffies" alone also workarounds the problem, without blacklisting thermal and processor modules or changing apic/acpi. Stumbled about that accidentally. Do you need any more information to debug/fix this problem?
> Do you need any more information to debug/fix this problem? please check the tip/master tree: http://people.redhat.com/mingo/tip.git/README just today i queued up fixes for similar problems. Can you still see the same problem with tip/master? Ingo
Still valid with tip/master tree. Or is there anything in particular that needs to be activated for the fixes to take effect?
(In reply to comment #14) > Still valid with tip/master tree. > > Or is there anything in particular that needs to be activated > for the fixes to take effect? Ping?
> > Or is there anything in particular that needs to be > > activated for the fixes to take effect? > > Ping? Does it still occur with latest tip:master? In particular this fix: a682604: rcu: Teach RCU that idle task is not quiscent state at boot Might fix such long delays. If it still occurs, could you also post a dmesg output that has been done with a kernel that has CONFIG_PRINTK_TIME=y enabled? That will print a seconds-since-bootup timestamp with each kernel message - that way we can see precisely where the delay is occuring. Thanks, Ingo
(In reply to comment #16) > Does it still occur with latest tip:master? In particular this > fix: > > a682604: rcu: Teach RCU that idle task is not quiscent state at boot > > Might fix such long delays. With latest tip (2.6.30-rc6) it has actually gotten worse. Directly after starting udev the system freezes hard, i was able to switch to VT1, but nothing else. Do you want a dmesg output from that kernel with PRINTK_TIME enabled and with clocksource=jiffies so that it actually boots? Is there some other information that might help you fix this? And please remember: The kernel doesn't make long delays during boot, it simply pauses until any key is pressed (repeatedly, maybe because of some interrupt problems?) otherwise the boot process does not continue after starting udev,
ping?
uhhm ... ping?
Checked again with 2.6.31-rc2, still valid. Boot process hangs when trying to start udev, could only reboot with Alt-SysRq. Is there anything i can do to help getting this fixed?
Hello! I checked this one against 2.6.31 final, and it is still valid. Pretty please, could we please get this fixed somehow? This has been in kernel since at least 5 stable releases, and prevents all newer distributions from even booting up. What information or help can i provide to help get this fixed?
I don't see this having been "git bisect"-ed. Given that the most fundamental problem here is lack of information about a hardware-specific problem, since it is a regression, "git bisect" should be able to give fine-grained information about what broke it.
Why wasn't this mentioned earlier? So bisecting this requires me to build and test what, like some dozen kernels between 2.6.26 and 2.6.31? Whoa ... Would this be the right documentation to start with? http://www.kernel.org/pub/software/scm/git/docs/git-bisect.html
That would be a decent place to start, yes. Latest working kernel version: 2.6.26 Earliest failing kernel version: 2.6.27-rc6 This is the obvious range to bisect within.
As far as i can tell, The faulty commit happenend between 2.6.26 and 2.6.27-rc1, git bisect told me something about linux-acpi that had been merged in between, but i forgot to save that information. My current problem is, that it always compiles the full kernel with every module, i thought it only recompiles the stuff that changed in between? This takes a whole lotta time just building the kernels, or am i doing something wrong?
Hi Florian, Somehow I missed the APIC information in this entry before, sorry. I can see an excerpt from the boot log above: IO-APIC (apicid-pin) 1-0 not connected. IO-APIC (apicid-pin) 1-17, 1-18, 1-19, 1-20, 1-21, 1-22, 1-23 not connected. ..TIMER: vector=0x31 apic1=0 pin1=16 apic2=-1 pin2=-1 ..MP-BIOS bug: 8254 timer not connected to IO-APIC ...trying to set up timer (IRQ0) through the 8259A ... ..... (found apic 0 pin 16) ... ....... failed. ...trying to set up timer as Virtual Wire IRQ... ..... works. This is unusually weird -- the kernel reports it was told, presumably by ACPI tables, that the IRQ0 timer interrupt is wired to the input #16 of the I/O APIC #0. While technically possible this would be the first time I have seen such a configuration and given that it does not work, it is almost surely wrong. It would be useful to get a dump of the ACPI tables for your system; I don't remember off the head how to do this, but perhaps someone else on the CC list for this bug will be able to tell you. If this is indeed what the ACPI table reports, then I suggest you look for a BIOS upgrade or report the bug to your technical support as appropriate and for Linux to cover the buggy systems a quirk will have to be added. Maciej
(In reply to comment #26) Well the laptop in question is like 5 years old now, and i don't know it has been in production before. There are no updates for nothing, it is considered a "legacy model" and i don't think i can convince sony into releasing a new bios update to support newer linux kernels. Can you imagine a realistic answer for such a request? Wouldn't it be better to mark this bug as RESOLVED DOCUMENTED or something the like, given that an easy workaround is available? (boot option clocksource=jiffies) And the acpi dump was provided as an attachment, isn't that the requested information? In any other case please someone tell me how to dump those needed acpi tables, please.
Sorry for marking it as RESOLVED DOCUMENTED, this was just a mistake.
Hi Florian, I missed the attachments, sorry -- yes, this is enough information, thank you. Given the circumstances I agree chances you'd convince Sony are slim. I thought this is about more recent a piece of hardware. As this is a regression -- could you please obtain a bootstrap log (with apic=verbose) from the last version of the kernel that worked for you? The changes that were made in the suspected area do not explain the regression -- the mode of failure indicates the timer shouldn't have worked before either, so clearly there is something wrong with my understanding of the problem. Perhaps the reason is not the APIC, but a coincidential change elsewhere. I wonder if it's possible that the Virtual Wire IRQ timer setup does not work anymore at all -- it's hardly ever used and is unique in that it is the only case where Linux sets up a local APIC interrupt in the native mode. Maciej
Have you tried to use option acpi_skip_timer_override as reported here: https://bugzilla.kernel.org/show_bug.cgi?id=15289 and here: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/254668/comments/125
Nope. I have sold the laptop last month, because i needed something a bit spiffier ;) . Anyway, i ran into some people who got a similar or the same problem, if i meet them again and the problem still persists, i'll try this and report back. But don't bet on that, at least not in the near future. For my part, this can be closed as RESOLVED DOCUMENTED like i mentioned above, because you only offer a workaround, too. Kind Regards