last Friday I've upgraded from kernel 22.214.171.124 SMP to
kernel 126.96.36.199 SMP on a busy compile box.
The machine freezed one day later. I've now downgraded
again as the box has been running 2.6.24(.7) fine for weeks.
I've set up another box (HP DL320 G5) using this kernel. If I run a loop of compiling the kernel with "make -j3" and also generate network traffic by downloading an ISO image from a box next to it, this system also freezes after 20 to 120 minutes.
I recompiled the kernel with Magic-SysReq key, but the serial console and the keyboard are dead. Funny thing though: If I push the power button for just a second, the screen unblanks. So something is still living...
I've configured these options to aid debugging the freeze:
# CONFIG_DEBUG_SHIRQ is not set
# CONFIG_SCHEDSTATS is not set
# CONFIG_TIMER_STATS is not set
# CONFIG_DEBUG_SLAB is not set
# CONFIG_RT_MUTEX_TESTER is not set
# CONFIG_LOCK_STAT is not set
# CONFIG_DEBUG_LOCKDEP is not set
# CONFIG_DEBUG_SPINLOCK_SLEEP is not set
# CONFIG_DEBUG_LOCKING_API_SELFTESTS is not set
# CONFIG_DEBUG_KOBJECT is not set
# CONFIG_DEBUG_HIGHMEM is not set
# CONFIG_DEBUG_INFO is not set
# CONFIG_DEBUG_VM is not set
# CONFIG_DEBUG_LIST is not set
# CONFIG_DEBUG_SG is not set
# CONFIG_FRAME_POINTER is not set
# CONFIG_BOOT_PRINTK_DELAY is not set
# CONFIG_RCU_TORTURE_TEST is not set
# CONFIG_BACKTRACE_SELF_TEST is not set
# CONFIG_FAULT_INJECTION is not set
# CONFIG_LATENCYTOP is not set
# CONFIG_PROVIDE_OHCI1394_DMA_INIT is not set
# CONFIG_SAMPLES is not set
# CONFIG_DEBUG_STACKOVERFLOW is not set
# CONFIG_DEBUG_STACK_USAGE is not set
# CONFIG_DEBUG_PAGEALLOC is not set
# CONFIG_DEBUG_RODATA_TEST is not set
# CONFIG_DEBUG_NX_TEST is not set
# CONFIG_4KSTACKS is not set
# CONFIG_IO_DELAY_UDELAY is not set
# CONFIG_IO_DELAY_NONE is not set
# CONFIG_DEBUG_BOOT_PARAMS is not set
# CONFIG_CPA_DEBUG is not set
Unfortunately they didn't output anything.
Here is a grep for "SMP" in the .config file:
# CONFIG_X86_BIGSMP is not set
# CONFIG_X86_VSMP is not set
And the same one for "mem":
# CONFIG_TINY_SHMEM is not set
# CONFIG_NOHIGHMEM is not set
# CONFIG_HIGHMEM64G is not set
# CONFIG_DISCONTIGMEM_MANUAL is not set
# CONFIG_SPARSEMEM_MANUAL is not set
# CONFIG_SPARSEMEM_VMEMMAP_ENABLE is not set
# CONFIG_BLK_DEV_UMEM is not set
# CONFIG_DEVKMEM is not set
# CONFIG_MEMSTICK is not set
# CONFIG_DEBUG_HIGHMEM is not set
Here's some info about the hardware:
- HP DL320 G5
- Dual core Intel(R) Xeon(R) CPU 3075 @ 2.66GHz
- 2,5GB RAM
- SATA HDD via ata_piix
Hope this helps a bit.
The kernel is from vanilla sources and compiled for 32 bit
using gcc 4.1.1. For testing purposes I've switched the NIC
with the traffic from "bnx2" to "tg3", though the original
freezing box is using e1000 anyway.
Is there anything else I can do to debug this one?
Thanks in advance,
I found the HP Proliant watchdog module and enabled it.
During the freeze it ouputs the following message on the serial console:
[10926.398907] BUG: soft lockup - CPU#1 stuck for 61s! [sh:6315]
[10991.898991] BUG: soft lockup - CPU#1 stuck for 61s! [sh:6315]
[11057.395084] BUG: soft lockup - CPU#1 stuck for 61s! [sh:6315]
Though the ACPI button trick to unblank to screen stopped working.
Please attach the output of "dmesg -s 1000000" after booting for both kernels.
Created attachment 16947 [details]
Dmesg output from 188.8.131.52 to 2.6.26
Luckily the output of the serial console gets logged,
so here's the dmesg output from 184.108.40.206, 220.127.116.11 and 2.6.26.
I'm currently testing 2.6.26, it has been running for 18 hours with an average load of twelve. I'll recompile now without kernel debugging and start the test again. Though that doesn't solve the mystery why 2.6.25 locks up.
2.6.26 seems to be running fine. Dunno if it's worth it to debug this further.
Is there anything obviously wrong in the dmesg output for 2.6.25?
Tested 18.104.22.168 for fun yesterday, still freezes.
Hmm, there is no obvious thing in the .26.25 log. The watchdog looks like timer related wreckage, but I doubt that as you seem to have NOHZ and HIGHRES disabled. It could be some locking problem as well. That could be checked with CONFIG_LOCKDEP if you really want to track it down.
I'm closing the bug with insufficient data.
Uhh ohhh, I've updated our main firewall box to 22.214.171.124 yesterday.
Today at 19h, it completly freezed, the keyboard handler was dead.
Last thing in the logs was a DHCP request -> network traffic.
The hardware is a HP DL320 G3 Celeron, so the SMP kernel is running in UP mode.
The same box has been running 2.6.24 SMP for weeks without a lockup.
If you don't mind I really would like to track this down in 2.6.25 as I can easily reproduce it with a HP DL320 G5p box. I already had CONFIG_LOCKDEP enabled (see the first problem description), is there anything particular needed on my side to further track this? Anything else I can do?
Dâniel Fraga recently reported he's seeing stalling TCP connections
if he enables HPET, though this has not been verified yet:
HPET is enabled in my kernel builds, too.
Any help is really appreciated.
could you try to disable hpet ? So we can see whether it is involved or not.
Thomas, thanks for your hint and your previous hint about the timer.
I did a "make defconfig" and gave it a burn-in test for one day -> No crash.
Then I took my configuration again and it crashed after two hours. So I diffed the default config with mine and found four significant changes:
I added those options to the default config and it crashed, too.
Right now I've disabled CONFIG_XEN and CONFIG_PM_LEGACY to see if it still crashes. If yes, it must be related to the RTC options.
"grep RTC .config" shows a warning:
# CONFIG_HPET_RTC_IRQ is not set
# Conflicting RTC option has been selected, check GEN_RTC and RTC
Do the RTC options cause the crash?
There was a problem in that area, which is fixed in current mainline.
CONFIG_RTC conflicts with CONFIG_RTC_LIB. current mainline excludes that on Kconfig level right now. There were reports about lockups with both options set, but I don't remember the details out of my head.
Hmm, I found something related to RTC on lkml:
Smells like the same issue. I patched my kernel with "kdb" and can't even enter it via the keyboard once the box crashed.
We have a cron job that calls hwclock every 4 hours, I just tried the mentioned
test-case "while :; do hwclock; done" and it crashed after seconds with the same error message from the watchdog -> I'll try the fix now.
The box is running the stress test fine for 22h, guess the bug is fixed.
I've added an additional "while :; do hwclock; done" job to the existing compile and download tests.
I'll close the bug as the fix will hopefully be in the next -stable release.
*** Bug 11422 has been marked as a duplicate of this bug. ***