Bug 11142
Summary: | Freeze with 2.6.25.11 SMP | ||
---|---|---|---|
Product: | Platform Specific/Hardware | Reporter: | Thomas Jarosch (thomas.jarosch) |
Component: | i386 | Assignee: | platform_i386 |
Status: | CLOSED PATCH_ALREADY_AVAILABLE | ||
Severity: | normal | CC: | bunk, gtdev |
Priority: | P1 | ||
Hardware: | All | ||
OS: | Linux | ||
Kernel Version: | 2.6.25.11 | Subsystem: | |
Regression: | Yes | Bisected commit-id: | |
Attachments: | Dmesg output from 2.6.24.7 to 2.6.26 |
Description
Thomas Jarosch
2008-07-22 00:21:37 UTC
I found the HP Proliant watchdog module and enabled it. During the freeze it ouputs the following message on the serial console: [10926.398907] BUG: soft lockup - CPU#1 stuck for 61s! [sh:6315] [10991.898991] BUG: soft lockup - CPU#1 stuck for 61s! [sh:6315] [11057.395084] BUG: soft lockup - CPU#1 stuck for 61s! [sh:6315] Though the ACPI button trick to unblank to screen stopped working. Please attach the output of "dmesg -s 1000000" after booting for both kernels. Created attachment 16947 [details]
Dmesg output from 2.6.24.7 to 2.6.26
Luckily the output of the serial console gets logged,
so here's the dmesg output from 2.6.24.7, 2.6.25.11 and 2.6.26.
I'm currently testing 2.6.26, it has been running for 18 hours with an average load of twelve. I'll recompile now without kernel debugging and start the test again. Though that doesn't solve the mystery why 2.6.25 locks up.
2.6.26 seems to be running fine. Dunno if it's worth it to debug this further. Is there anything obviously wrong in the dmesg output for 2.6.25? Tested 2.6.25.15 for fun yesterday, still freezes. Hmm, there is no obvious thing in the .26.25 log. The watchdog looks like timer related wreckage, but I doubt that as you seem to have NOHZ and HIGHRES disabled. It could be some locking problem as well. That could be checked with CONFIG_LOCKDEP if you really want to track it down. I'm closing the bug with insufficient data. Uhh ohhh, I've updated our main firewall box to 2.6.26.4 yesterday. Today at 19h, it completly freezed, the keyboard handler was dead. Last thing in the logs was a DHCP request -> network traffic. The hardware is a HP DL320 G3 Celeron, so the SMP kernel is running in UP mode. The same box has been running 2.6.24 SMP for weeks without a lockup. If you don't mind I really would like to track this down in 2.6.25 as I can easily reproduce it with a HP DL320 G5p box. I already had CONFIG_LOCKDEP enabled (see the first problem description), is there anything particular needed on my side to further track this? Anything else I can do? Dâniel Fraga recently reported he's seeing stalling TCP connections if he enables HPET, though this has not been verified yet: http://marc.info/?l=linux-netdev&m=122090525823708 HPET is enabled in my kernel builds, too. Any help is really appreciated. could you try to disable hpet ? So we can see whether it is involved or not. Thomas, thanks for your hint and your previous hint about the timer. I did a "make defconfig" and gave it a burn-in test for one day -> No crash. Then I took my configuration again and it crashed after two hours. So I diffed the default config with mine and found four significant changes: CONFIG_XEN=y CONFIG_HPET_EMULATE_RTC=y CONFIG_PM_LEGACY=y CONFIG_RTC=y I added those options to the default config and it crashed, too. Right now I've disabled CONFIG_XEN and CONFIG_PM_LEGACY to see if it still crashes. If yes, it must be related to the RTC options. "grep RTC .config" shows a warning: CONFIG_HPET_EMULATE_RTC=y CONFIG_RTC=y # CONFIG_HPET_RTC_IRQ is not set CONFIG_RTC_LIB=m CONFIG_RTC_CLASS=m # Conflicting RTC option has been selected, check GEN_RTC and RTC Do the RTC options cause the crash? There was a problem in that area, which is fixed in current mainline. CONFIG_RTC conflicts with CONFIG_RTC_LIB. current mainline excludes that on Kconfig level right now. There were reports about lockups with both options set, but I don't remember the details out of my head. Hmm, I found something related to RTC on lkml: http://marc.info/?l=linux-kernel&m=122072599800801 Smells like the same issue. I patched my kernel with "kdb" and can't even enter it via the keyboard once the box crashed. We have a cron job that calls hwclock every 4 hours, I just tried the mentioned test-case "while :; do hwclock; done" and it crashed after seconds with the same error message from the watchdog -> I'll try the fix now. The box is running the stress test fine for 22h, guess the bug is fixed. I've added an additional "while :; do hwclock; done" job to the existing compile and download tests. I'll close the bug as the fix will hopefully be in the next -stable release. *** Bug 11422 has been marked as a duplicate of this bug. *** |