Latest working kernel version: Unknow Earliest failing kernel version: 2.6.25-rc2, 2.6.25-rc3 Distribution: RHEL 5.1 Hardware Environment: Intel Xeon 3.40GHz, 4 cores per cpu Software Environment: RHEL 5.1 Problem Description: When i run cpu hotplug stress test on a system with 16 logical cpus, some processes are deadlocked, it can be regenerated on 2.6.25-rc2 and 2.6.25-rc3. The kernel thread [watchdog/1] isn't reaped by its parent when cpu #1 is set to offline, so that [kstopmachine] sleeps for ever. Steps to reproduce: Here is my stress test script, you may try it on SMP platforms (the more number of cpu is , the easier it can be regenerated.). ######################################################################## ####### stresscpus.sh ####### #!/bin/sh for i in 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 do ./bugexposure.sh $i & done ######## bugexposure.sh ########### #!/bin/bash main(){ typeset -i CPU=$1 ./task.sh > /dev/null& PID=$! if [ `cat /sys/devices/system/cpu/cpu${CPU}/online` = "0" ]; then echo "1" > /sys/devices/system/cpu/cpu${CPU}/online fi MASK=$((1<<${CPU})) `taskset -p ${MASK} ${PID} > /dev/null 2>&1` echo "0" > /sys/devices/system/cpu/cpu${CPU}/online echo "1" > /sys/devices/system/cpu/cpu${CPU}/online disown $PID kill -9 $PID > /dev/null 2>&1 #echo "PASS\n" } typeset -i TEST_CPU=$1 while true do main $TEST_CPU done ###### task.sh ###### #!/bin/bash while : do NOOP=1 done ########################################################################
After i investigated, debugged and analyzed it, i think it is a scheduler specific issue, i noticed someone submitted a patch which mentioned kthread_should_stop calling rule: "In order to safely use kthread_stop() for kthread, there is a requirement on how its main loop has to be orginized. Namely, the sequence of events that lead to kthread being blocked (schedule()) has to be ordered as follows: set_current_state(TASK_INTERRUPTIBLE); if (kthread_should_stop()) break; schedule() or similar. set_current_state() implies a full memory barrier. kthread_stop() has a matching barrier right after an update of kthread_stop_info.k and before kthread's wakeup." I don't think it is a good programming paradigm because it will result in some issues very easily, why can't we provide a simple wrapper for it? This issue seems such one, but i tried to change it to follow this rule but the issue is still there. Why isn't the kernel thread [watchdog/1] reaped by its parent? its state is TASK_RUNNING with high priority (R< means this), why it isn't done? Anyone ever met such a problem? Your thought?
My tracing and analysis info is below: # ps aux | grep watchdog root 5 0.0 0.0 0 0 ? S< 22:26 0:00 [watchdog/0] root 8 0.0 0.0 0 0 ? R< 22:26 0:00 [watchdog/1] root 11 0.0 0.0 0 0 ? S< 22:26 0:00 [watchdog/2] root 14 0.0 0.0 0 0 ? S< 22:26 0:00 [watchdog/3] root 17 0.0 0.0 0 0 ? S< 22:26 0:00 [watchdog/4] root 20 0.0 0.0 0 0 ? S< 22:26 0:00 [watchdog/5] root 23 0.0 0.0 0 0 ? S< 22:26 0:00 [watchdog/6] root 26 0.0 0.0 0 0 ? S< 22:26 0:00 [watchdog/7] root 29 0.0 0.0 0 0 ? S< 22:26 0:00 [watchdog/8] root 32 0.0 0.0 0 0 ? S< 22:26 0:00 [watchdog/9] root 35 0.0 0.0 0 0 ? S< 22:26 0:00 [watchdog/10] root 38 0.0 0.0 0 0 ? S< 22:26 0:00 [watchdog/11] root 41 0.0 0.0 0 0 ? S< 22:26 0:00 [watchdog/12] root 44 0.0 0.0 0 0 ? S< 22:26 0:00 [watchdog/13] root 50 0.0 0.0 0 0 ? S< 22:26 0:00 [watchdog/15] root 5641 0.0 0.0 61144 708 pts/0 R+ 22:58 0:00 grep watchdog # # ps aux | grep ksoftirqd root 4 0.0 0.0 0 0 ? S< 22:26 0:00 [ksoftirqd/0] root 10 0.0 0.0 0 0 ? S< 22:26 0:00 [ksoftirqd/2] root 13 0.0 0.0 0 0 ? S< 22:26 0:00 [ksoftirqd/3] root 16 0.0 0.0 0 0 ? S< 22:26 0:00 [ksoftirqd/4] root 19 0.0 0.0 0 0 ? S< 22:26 0:00 [ksoftirqd/5] root 22 0.0 0.0 0 0 ? S< 22:26 0:00 [ksoftirqd/6] root 25 0.0 0.0 0 0 ? S< 22:26 0:00 [ksoftirqd/7] root 28 0.0 0.0 0 0 ? S< 22:26 0:00 [ksoftirqd/8] root 31 0.0 0.0 0 0 ? S< 22:26 0:00 [ksoftirqd/9] root 34 0.0 0.0 0 0 ? S< 22:26 0:00 [ksoftirqd/10] root 37 0.0 0.0 0 0 ? S< 22:26 0:00 [ksoftirqd/11] root 40 0.0 0.0 0 0 ? S< 22:26 0:00 [ksoftirqd/12] root 43 0.0 0.0 0 0 ? S< 22:26 0:00 [ksoftirqd/13] root 49 0.0 0.0 0 0 ? S< 22:26 0:00 [ksoftirqd/15] root 5647 0.0 0.0 61144 712 pts/0 R+ 23:00 0:00 grep ksoftirqd # #ps aux ... root 5554 0.0 0.0 0 0 ? S< 22:42 0:00 [kstopmachine] root 5556 0.0 0.0 0 0 ? Z< 22:42 0:00 [kstopmachine] <defunct> root 5557 0.0 0.0 0 0 ? Z< 22:42 0:00 [kstopmachine] <defunct> root 5558 0.0 0.0 0 0 ? Z< 22:42 0:00 [kstopmachine] <defunct> root 5559 0.0 0.0 0 0 ? Z< 22:42 0:00 [kstopmachine] <defunct> root 5560 0.0 0.0 0 0 ? Z< 22:42 0:00 [kstopmachine] <defunct> root 5561 0.0 0.0 0 0 ? Z< 22:42 0:00 [kstopmachine] <defunct> root 5562 0.0 0.0 0 0 ? Z< 22:42 0:00 [kstopmachine] <defunct> root 5563 0.0 0.0 0 0 ? Z< 22:42 0:00 [kstopmachine] <defunct> root 5564 0.0 0.0 0 0 ? Z< 22:42 0:00 [kstopmachine] <defunct> root 5565 0.0 0.0 0 0 ? Z< 22:42 0:00 [kstopmachine] <defunct> root 5566 0.0 0.0 0 0 ? Z< 22:42 0:00 [kstopmachine] <defunct> root 5567 0.0 0.0 0 0 ? Z< 22:42 0:00 [kstopmachine] <defunct> root 5568 0.0 0.0 0 0 ? Z< 22:42 0:00 [kstopmachine] <defunct> root 5569 0.0 0.0 0 0 ? Z< 22:42 0:00 [kstopmachine] <defunct> ... [watchdog/1] isn't terminated so that [kstopmachine] will waiting for it for ever, so that "echo 0 > /sys/devices/syste/cpu/cpu1/online" will hang up, so that hotplug spinlock will be holden by it, so that other spinlock contenter will wait for locking it for ever, so that... These are some softlock detection output: process 5471 (task.sh) no longer affine to cpu1 INFO: task group_balance:51 blocked for more than 120 seconds. "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. group_balance D ffff810001011580 0 51 2 ffff81027e97fdc0 0000000000000046 ffff81027e97fdd0 ffffffff80472b15 0000000000000000 ffff81027e97d280 ffff81027c8bd300 ffff81027e97d5d0 0000000080569da0 ffff81027e97d5d0 000000007e97d280 000000010009d366 Call Trace: [<ffffffff80472b15>] thread_return+0x3d/0xa2 [<ffffffff8023cce5>] lock_timer_base+0x26/0x4b [<ffffffff8047338b>] __mutex_lock_slowpath+0x5c/0x93 [<ffffffff80473221>] mutex_lock+0x1a/0x1e [<ffffffff8024f8fd>] get_online_cpus+0x27/0x3e [<ffffffff8022fd04>] load_balance_monitor+0x60/0x375 [<ffffffff8022fca4>] load_balance_monitor+0x0/0x375 [<ffffffff80245ff6>] kthread+0x47/0x75 [<ffffffff802301f0>] schedule_tail+0x28/0x5c [<ffffffff8020cc78>] child_rip+0xa/0x12 [<ffffffff80245faf>] kthread+0x0/0x75 [<ffffffff8020cc6e>] child_rip+0x0/0x12 INFO: task bugexposure.sh:5462 blocked for more than 120 seconds. "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. bugexposure.s D ffff8100010e1580 0 5462 1 ffff81026a981c58 0000000000000082 ffff81026a981c68 ffffffff80472b15 00000000000004d6 ffff81027c83c920 ffff81027e8bd680 ffff81027c83cc70 0000000d7e8bc4c0 ffff81027c83cc70 0000000d00000002 000000010009d809 Call Trace: [<ffffffff80472b15>] thread_return+0x3d/0xa2 [<ffffffff80314b25>] __next_cpu+0x19/0x28 [<ffffffff8022a5d4>] enqueue_rt_entity+0x4e/0xb1 [<ffffffff80472f34>] schedule_timeout+0x1e/0xad [<ffffffff802280bf>] enqueue_task+0x4d/0x58 [<ffffffff80472ccf>] wait_for_common+0xd5/0x118 [<ffffffff8022c31b>] default_wake_function+0x0/0xe [<ffffffff80245c93>] kthread_stop+0x58/0x79 [<ffffffff8046fdcf>] cpu_callback+0x189/0x197 [<ffffffff8046f59e>] cpu_callback+0x1cf/0x274 [<ffffffff8026c07e>] writeback_set_ratelimit+0x19/0x68 [<ffffffff8047658b>] notifier_call_chain+0x29/0x4c [<ffffffff8024f74d>] _cpu_down+0x1d6/0x2aa [<ffffffff8024f845>] cpu_down+0x24/0x31 [<ffffffff8038fcf4>] store_online+0x29/0x67 [<ffffffff802d23e1>] sysfs_write_file+0xd2/0x110 [<ffffffff8028db49>] vfs_write+0xad/0x136 [<ffffffff8028e086>] sys_write+0x45/0x6e [<ffffffff8020bfd9>] tracesys+0xdc/0xe1
*** Bug 10169 has been marked as a duplicate of this bug. ***
This entry is being used for tracking a regression from 2.6.24. Please don't close it until the problem is fixed in the mainline. Handled-By : Gautham R Shenoy <ego@in.ibm.com>
All the disscussion: http://groups.google.com/group/linux.kernel/browse_thread/thread/1fb3b2eb0b03a329/f79dfe4eb7a6c87a and http://lkml.org/lkml/2008/3/3/11
At the beginning, i thought Detect Soft Lockups (CONFIG_DETECT_SOFTLOCKUP) is culprit because it added much code to detect lock dependencies and deadlocks, but the issue is still there after i disabled soft lock detection, it can be regenerated but will take much long time. At this time, group_balance and kblockd are in "R<" state. [root@harwich-rh ~]# ps aux | grep kstop root 15466 0.0 0.0 0 0 ? S< 01:03 0:00 [kstopmachine] root 15467 0.0 0.0 0 0 ? Z< 01:03 0:00 [kstopmachine] <defunct> root 15468 0.0 0.0 0 0 ? Z< 01:03 0:00 [kstopmachine] <defunct> root 15469 0.0 0.0 0 0 ? Z< 01:03 0:00 [kstopmachine] <defunct> root 15470 0.0 0.0 0 0 ? Z< 01:03 0:00 [kstopmachine] <defunct> root 15471 0.0 0.0 0 0 ? Z< 01:03 0:00 [kstopmachine] <defunct> root 15472 0.0 0.0 0 0 ? Z< 01:03 0:00 [kstopmachine] <defunct> root 15473 0.0 0.0 0 0 ? Z< 01:03 0:00 [kstopmachine] <defunct> root 15474 0.0 0.0 0 0 ? Z< 01:03 0:00 [kstopmachine] <defunct> root 15475 0.0 0.0 0 0 ? Z< 01:03 0:00 [kstopmachine] <defunct> root 15476 0.0 0.0 0 0 ? Z< 01:03 0:00 [kstopmachine] <defunct> root 17155 0.0 0.0 61148 708 pts/3 R+ 14:47 0:00 grep kstop [root@harwich-rh ~]# ps aux | grep "R<" root 35 0.0 0.0 0 0 ? R< Mar13 0:00 [group_balance] root 182 0.1 0.0 0 0 ? R< Mar13 1:09 [kblockd/0] root 17157 0.0 0.0 61148 704 pts/3 R+ 14:47 0:00 grep R< [root@harwich-rh ~]#
I also tried to specify higher priority for cpu_callback in kernel/softlockup.c so that it can be called firstly when cpu is set to offline, but the problem is still there, i really don't have any way to debug it. :-)
Regressions list annotation: References : http://lkml.org/lkml/2008/3/3/11
Regressions list annotation: Handled-By : Gregory Haskins <ghaskins@novell.com> Patch : http://lkml.org/lkml/2008/3/8/4
fixed by commit 393d94d98b19089ec172566e23557997931b137e