Bug 11142

Summary:	Freeze with 2.6.25.11 SMP
Product:	Platform Specific/Hardware	Reporter:	Thomas Jarosch (thomas.jarosch)
Component:	i386	Assignee:	platform_i386
Status:	CLOSED PATCH_ALREADY_AVAILABLE
Severity:	normal	CC:	bunk, gtdev
Priority:	P1
Hardware:	All
OS:	Linux
Kernel Version:	2.6.25.11	Subsystem:
Regression:	Yes	Bisected commit-id:
Attachments:	Dmesg output from 2.6.24.7 to 2.6.26

Description Thomas Jarosch 2008-07-22 00:21:37 UTC

Hello together,

last Friday I've upgraded from kernel 2.6.24.7 SMP to
kernel 2.6.25.11 SMP on a busy compile box.
The machine freezed one day later. I've now downgraded
again as the box has been running 2.6.24(.7) fine for weeks.

I've set up another box (HP DL320 G5) using this kernel. If I run a loop of compiling the kernel with "make -j3" and also generate network traffic by downloading an ISO image from a box next to it, this system also freezes after 20 to 120 minutes.

I recompiled the kernel with Magic-SysReq key, but the serial console and the keyboard are dead. Funny thing though: If I push the power button for just a second, the screen unblanks. So something is still living...

I've configured these options to aid debugging the freeze:
CONFIG_PRINTK_TIME=y
CONFIG_MAGIC_SYSRQ=y
CONFIG_DEBUG_KERNEL=y
# CONFIG_DEBUG_SHIRQ is not set
CONFIG_DETECT_SOFTLOCKUP=y
CONFIG_SCHED_DEBUG=y
# CONFIG_SCHEDSTATS is not set
# CONFIG_TIMER_STATS is not set
# CONFIG_DEBUG_SLAB is not set
CONFIG_DEBUG_RT_MUTEXES=y
CONFIG_DEBUG_PI_LIST=y
# CONFIG_RT_MUTEX_TESTER is not set
CONFIG_DEBUG_SPINLOCK=y
CONFIG_DEBUG_MUTEXES=y
CONFIG_DEBUG_LOCK_ALLOC=y
CONFIG_PROVE_LOCKING=y
CONFIG_LOCKDEP=y
# CONFIG_LOCK_STAT is not set
# CONFIG_DEBUG_LOCKDEP is not set
CONFIG_TRACE_IRQFLAGS=y
# CONFIG_DEBUG_SPINLOCK_SLEEP is not set
# CONFIG_DEBUG_LOCKING_API_SELFTESTS is not set
CONFIG_STACKTRACE=y
# CONFIG_DEBUG_KOBJECT is not set
# CONFIG_DEBUG_HIGHMEM is not set
CONFIG_DEBUG_BUGVERBOSE=y
# CONFIG_DEBUG_INFO is not set
# CONFIG_DEBUG_VM is not set
# CONFIG_DEBUG_LIST is not set
# CONFIG_DEBUG_SG is not set
# CONFIG_FRAME_POINTER is not set
# CONFIG_BOOT_PRINTK_DELAY is not set
# CONFIG_RCU_TORTURE_TEST is not set
# CONFIG_BACKTRACE_SELF_TEST is not set
# CONFIG_FAULT_INJECTION is not set
# CONFIG_LATENCYTOP is not set
# CONFIG_PROVIDE_OHCI1394_DMA_INIT is not set
# CONFIG_SAMPLES is not set
CONFIG_EARLY_PRINTK=y
# CONFIG_DEBUG_STACKOVERFLOW is not set
# CONFIG_DEBUG_STACK_USAGE is not set
# CONFIG_DEBUG_PAGEALLOC is not set
CONFIG_DEBUG_RODATA=y
# CONFIG_DEBUG_RODATA_TEST is not set
# CONFIG_DEBUG_NX_TEST is not set
# CONFIG_4KSTACKS is not set
CONFIG_X86_FIND_SMP_CONFIG=y
CONFIG_X86_MPPARSE=y
CONFIG_DOUBLEFAULT=y
# CONFIG_IO_DELAY_UDELAY is not set
# CONFIG_IO_DELAY_NONE is not set
CONFIG_DEFAULT_IO_DELAY_TYPE=0
# CONFIG_DEBUG_BOOT_PARAMS is not set
# CONFIG_CPA_DEBUG is not set

Unfortunately they didn't output anything.


Here is a grep for "SMP" in the .config file:
CONFIG_X86_SMP=y
CONFIG_X86_32_SMP=y
CONFIG_SMP=y
# CONFIG_X86_BIGSMP is not set
# CONFIG_X86_VSMP is not set
CONFIG_PM_SLEEP_SMP=y
CONFIG_SCSI_SAS_HOST_SMP=y
CONFIG_X86_FIND_SMP_CONFIG=y

And the same one for "mem":
CONFIG_SHMEM=y
# CONFIG_TINY_SHMEM is not set
# CONFIG_NOHIGHMEM is not set
CONFIG_HIGHMEM4G=y
# CONFIG_HIGHMEM64G is not set
CONFIG_HIGHMEM=y
CONFIG_ARCH_FLATMEM_ENABLE=y
CONFIG_ARCH_SPARSEMEM_ENABLE=y
CONFIG_ARCH_SELECT_MEMORY_MODEL=y
CONFIG_SELECT_MEMORY_MODEL=y
CONFIG_FLATMEM_MANUAL=y
# CONFIG_DISCONTIGMEM_MANUAL is not set
# CONFIG_SPARSEMEM_MANUAL is not set
CONFIG_FLATMEM=y
CONFIG_FLAT_NODE_MEM_MAP=y
CONFIG_SPARSEMEM_STATIC=y
# CONFIG_SPARSEMEM_VMEMMAP_ENABLE is not set
CONFIG_ARCH_ENABLE_MEMORY_HOTPLUG=y
# CONFIG_BLK_DEV_UMEM is not set
CONFIG_INPUT_FF_MEMLESS=y
# CONFIG_DEVKMEM is not set
CONFIG_FIX_EARLYCON_MEM=y
# CONFIG_MEMSTICK is not set
# CONFIG_DEBUG_HIGHMEM is not set
CONFIG_HAS_IOMEM=y

Here's some info about the hardware:
- HP DL320 G5
- Dual core Intel(R) Xeon(R) CPU 3075  @ 2.66GHz
- 2,5GB RAM
- SATA HDD via ata_piix

Hope this helps a bit.

The kernel is from vanilla sources and compiled for 32 bit
using gcc 4.1.1. For testing purposes I've switched the NIC
with the traffic from "bnx2" to "tg3", though the original
freezing box is using e1000 anyway.

Is there anything else I can do to debug this one?

Thanks in advance,
Thomas

Comment 1 Thomas Jarosch 2008-07-22 04:30:32 UTC

I found the HP Proliant watchdog module and enabled it.
During the freeze it ouputs the following message on the serial console:

[10926.398907] BUG: soft lockup - CPU#1 stuck for 61s! [sh:6315]
[10991.898991] BUG: soft lockup - CPU#1 stuck for 61s! [sh:6315]
[11057.395084] BUG: soft lockup - CPU#1 stuck for 61s! [sh:6315]

Though the ACPI button trick to unblank to screen stopped working.

Comment 2 Adrian Bunk 2008-07-22 11:55:09 UTC

Please attach the output of "dmesg -s 1000000" after booting for both kernels.

Comment 3 Thomas Jarosch 2008-07-23 01:18:09 UTC

Created attachment 16947 [details]
Dmesg output from 2.6.24.7 to 2.6.26

Luckily the output of the serial console gets logged,
so here's the dmesg output from 2.6.24.7, 2.6.25.11 and 2.6.26.

I'm currently testing 2.6.26, it has been running for 18 hours with an average load of twelve. I'll recompile now without kernel debugging and start the test again. Though that doesn't solve the mystery why 2.6.25 locks up.

Comment 4 Thomas Jarosch 2008-07-30 04:59:50 UTC

2.6.26 seems to be running fine. Dunno if it's worth it to debug this further.

Is there anything obviously wrong in the dmesg output for 2.6.25?

Comment 5 Thomas Jarosch 2008-08-19 01:15:23 UTC

Tested 2.6.25.15 for fun yesterday, still freezes.

Comment 6 Thomas Gleixner 2008-09-04 12:49:55 UTC

Hmm, there is no obvious thing in the .26.25 log. The watchdog looks like timer related wreckage, but I doubt that as you seem to have NOHZ and HIGHRES disabled. It could be some locking problem as well. That could be checked with CONFIG_LOCKDEP if you really want to track it down.

I'm closing the bug with insufficient data.

Comment 7 Thomas Jarosch 2008-09-09 13:03:23 UTC

Uhh ohhh, I've updated our main firewall box to 2.6.26.4 yesterday.
Today at 19h, it completly freezed, the keyboard handler was dead.
Last thing in the logs was a DHCP request -> network traffic.

The hardware is a HP DL320 G3 Celeron, so the SMP kernel is running in UP mode.
The same box has been running 2.6.24 SMP for weeks without a lockup.

If you don't mind I really would like to track this down in 2.6.25 as I can easily reproduce it with a HP DL320 G5p box. I already had CONFIG_LOCKDEP enabled (see the first problem description), is there anything particular needed on my side to further track this? Anything else I can do?

Dâniel Fraga recently reported he's seeing stalling TCP connections
if he enables HPET, though this has not been verified yet:
http://marc.info/?l=linux-netdev&m=122090525823708
HPET is enabled in my kernel builds, too.

Any help is really appreciated.

Comment 8 Thomas Gleixner 2008-09-11 03:44:03 UTC

could you try to disable hpet ? So we can see whether it is involved or not.

Comment 9 Thomas Jarosch 2008-09-11 04:05:02 UTC

Thomas, thanks for your hint and your previous hint about the timer.

I did a "make defconfig" and gave it a burn-in test for one day -> No crash.
Then I took my configuration again and it crashed after two hours. So I diffed the default config with mine and found four significant changes:

CONFIG_XEN=y
CONFIG_HPET_EMULATE_RTC=y
CONFIG_PM_LEGACY=y
CONFIG_RTC=y

I added those options to the default config and it crashed, too.
Right now I've disabled CONFIG_XEN and CONFIG_PM_LEGACY to see if it still crashes. If yes, it must be related to the RTC options.

"grep RTC .config" shows a warning:
CONFIG_HPET_EMULATE_RTC=y
CONFIG_RTC=y
# CONFIG_HPET_RTC_IRQ is not set
CONFIG_RTC_LIB=m
CONFIG_RTC_CLASS=m
# Conflicting RTC option has been selected, check GEN_RTC and RTC

Do the RTC options cause the crash?

Comment 10 Thomas Gleixner 2008-09-11 05:31:15 UTC

There was a problem in that area, which is fixed in current mainline.

CONFIG_RTC conflicts with CONFIG_RTC_LIB. current mainline excludes that on Kconfig level right now. There were reports about lockups with both options set, but I don't remember the details out of my head.

Comment 11 Thomas Jarosch 2008-09-11 06:05:02 UTC

Hmm, I found something related to RTC on lkml:
http://marc.info/?l=linux-kernel&m=122072599800801

Smells like the same issue. I patched my kernel with "kdb" and can't even enter it via the keyboard once the box crashed.

We have a cron job that calls hwclock every 4 hours, I just tried the mentioned
test-case "while :; do hwclock; done" and it crashed after seconds with the same error message from the watchdog -> I'll try the fix now.

Comment 12 Thomas Jarosch 2008-09-12 05:16:46 UTC

The box is running the stress test fine for 22h, guess the bug is fixed.
I've added an additional "while :; do hwclock; done" job to the existing compile and download tests.

I'll close the bug as the fix will hopefully be in the next -stable release.

Comment 13 Rafael J. Wysocki 2008-09-14 16:31:21 UTC

*** Bug 11422 has been marked as a duplicate of this bug. ***