Bug 10157

Summary: Some processes are dealocked when one cpu is set to offline
Product: Process Management Reporter: Yi Yang (yi.y.yang)
Component: SchedulerAssignee: Ingo Molnar (mingo)
Status: CLOSED CODE_FIX    
Severity: blocking CC: bunk, ego, rjw, yi.y.yang
Priority: P1    
Hardware: All   
OS: Linux   
Kernel Version: 2.6.25-rc2 and rc3 Subsystem:
Regression: Yes Bisected commit-id:
Bug Depends on:    
Bug Blocks: 9832    

Description Yi Yang 2008-03-02 22:31:32 UTC
Latest working kernel version: Unknow
Earliest failing kernel version: 2.6.25-rc2, 2.6.25-rc3
Distribution: RHEL 5.1
Hardware Environment: Intel Xeon 3.40GHz, 4 cores per cpu
Software Environment: RHEL 5.1
Problem Description:

When i run cpu hotplug stress test on a system with 16 logical cpus,
some processes are deadlocked, it can be regenerated on 2.6.25-rc2
and 2.6.25-rc3.

The kernel thread [watchdog/1] isn't reaped by its parent when cpu #1
is set to offline, so that [kstopmachine] sleeps for ever.


Steps to reproduce:

Here is my stress test script, you may try it on SMP platforms (the more
number of cpu is ,  the easier it can be regenerated.).

########################################################################

####### stresscpus.sh #######

#!/bin/sh

for i in 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
do
        ./bugexposure.sh $i &
done


######## bugexposure.sh ###########

#!/bin/bash

main(){

    typeset -i CPU=$1
    ./task.sh > /dev/null&
    PID=$!

    if [ `cat /sys/devices/system/cpu/cpu${CPU}/online` = "0" ]; then
        echo "1" > /sys/devices/system/cpu/cpu${CPU}/online
    fi

    MASK=$((1<<${CPU}))

    `taskset -p ${MASK} ${PID} > /dev/null 2>&1`

    echo "0" > /sys/devices/system/cpu/cpu${CPU}/online

    echo "1" > /sys/devices/system/cpu/cpu${CPU}/online

    disown $PID
    kill -9 $PID > /dev/null 2>&1

    #echo "PASS\n"

}

typeset -i TEST_CPU=$1
while true
do
        main $TEST_CPU
done


###### task.sh ######
#!/bin/bash

while :
do
  NOOP=1
done

########################################################################
Comment 1 Yi Yang 2008-03-02 22:32:35 UTC
After i investigated, debugged and analyzed it, i think it is a scheduler
specific issue, i noticed someone submitted a patch which mentioned
kthread_should_stop calling rule:

"In order to safely use kthread_stop() for kthread, there is a requirement
on how its main loop has to be orginized. Namely, the sequence of
events that lead to kthread being blocked (schedule()) has to be
ordered as follows:

    set_current_state(TASK_INTERRUPTIBLE);
    if (kthread_should_stop()) break;
    schedule() or similar.

set_current_state() implies a full memory barrier. kthread_stop()
has a matching barrier right after an update of kthread_stop_info.k
and before kthread's wakeup."

I don't think it is a good programming paradigm because it will result
in some issues very easily, why can't we provide a simple wrapper for it?

This issue seems such one, but i tried to change it to follow this rule but
the issue is still there.

Why isn't the kernel thread [watchdog/1] reaped by its parent? its state
is TASK_RUNNING with high priority (R< means this), why it isn't done?

Anyone ever met such a problem? Your thought?
Comment 2 Yi Yang 2008-03-02 22:33:17 UTC
My tracing and analysis info is below:

# ps aux | grep watchdog
root         5  0.0  0.0      0     0 ?        S<   22:26   0:00 [watchdog/0]
root         8  0.0  0.0      0     0 ?        R<   22:26   0:00 [watchdog/1]
root        11  0.0  0.0      0     0 ?        S<   22:26   0:00 [watchdog/2]
root        14  0.0  0.0      0     0 ?        S<   22:26   0:00 [watchdog/3]
root        17  0.0  0.0      0     0 ?        S<   22:26   0:00 [watchdog/4]
root        20  0.0  0.0      0     0 ?        S<   22:26   0:00 [watchdog/5]
root        23  0.0  0.0      0     0 ?        S<   22:26   0:00 [watchdog/6]
root        26  0.0  0.0      0     0 ?        S<   22:26   0:00 [watchdog/7]
root        29  0.0  0.0      0     0 ?        S<   22:26   0:00 [watchdog/8]
root        32  0.0  0.0      0     0 ?        S<   22:26   0:00 [watchdog/9]
root        35  0.0  0.0      0     0 ?        S<   22:26   0:00 [watchdog/10]
root        38  0.0  0.0      0     0 ?        S<   22:26   0:00 [watchdog/11]
root        41  0.0  0.0      0     0 ?        S<   22:26   0:00 [watchdog/12]
root        44  0.0  0.0      0     0 ?        S<   22:26   0:00 [watchdog/13]
root        50  0.0  0.0      0     0 ?        S<   22:26   0:00 [watchdog/15]
root      5641  0.0  0.0  61144   708 pts/0    R+   22:58   0:00 grep watchdog
#

# ps aux | grep ksoftirqd
root         4  0.0  0.0      0     0 ?        S<   22:26   0:00 [ksoftirqd/0]
root        10  0.0  0.0      0     0 ?        S<   22:26   0:00 [ksoftirqd/2]
root        13  0.0  0.0      0     0 ?        S<   22:26   0:00 [ksoftirqd/3]
root        16  0.0  0.0      0     0 ?        S<   22:26   0:00 [ksoftirqd/4]
root        19  0.0  0.0      0     0 ?        S<   22:26   0:00 [ksoftirqd/5]
root        22  0.0  0.0      0     0 ?        S<   22:26   0:00 [ksoftirqd/6]
root        25  0.0  0.0      0     0 ?        S<   22:26   0:00 [ksoftirqd/7]
root        28  0.0  0.0      0     0 ?        S<   22:26   0:00 [ksoftirqd/8]
root        31  0.0  0.0      0     0 ?        S<   22:26   0:00 [ksoftirqd/9]
root        34  0.0  0.0      0     0 ?        S<   22:26   0:00 [ksoftirqd/10]
root        37  0.0  0.0      0     0 ?        S<   22:26   0:00 [ksoftirqd/11]
root        40  0.0  0.0      0     0 ?        S<   22:26   0:00 [ksoftirqd/12]
root        43  0.0  0.0      0     0 ?        S<   22:26   0:00 [ksoftirqd/13]
root        49  0.0  0.0      0     0 ?        S<   22:26   0:00 [ksoftirqd/15]
root      5647  0.0  0.0  61144   712 pts/0    R+   23:00   0:00 grep ksoftirqd
#

#ps aux
...
root      5554  0.0  0.0      0     0 ?        S<   22:42   0:00 [kstopmachine]
root      5556  0.0  0.0      0     0 ?        Z<   22:42   0:00 [kstopmachine] <defunct>
root      5557  0.0  0.0      0     0 ?        Z<   22:42   0:00 [kstopmachine] <defunct>
root      5558  0.0  0.0      0     0 ?        Z<   22:42   0:00 [kstopmachine] <defunct>
root      5559  0.0  0.0      0     0 ?        Z<   22:42   0:00 [kstopmachine] <defunct>
root      5560  0.0  0.0      0     0 ?        Z<   22:42   0:00 [kstopmachine] <defunct>
root      5561  0.0  0.0      0     0 ?        Z<   22:42   0:00 [kstopmachine] <defunct>
root      5562  0.0  0.0      0     0 ?        Z<   22:42   0:00 [kstopmachine] <defunct>
root      5563  0.0  0.0      0     0 ?        Z<   22:42   0:00 [kstopmachine] <defunct>
root      5564  0.0  0.0      0     0 ?        Z<   22:42   0:00 [kstopmachine] <defunct>
root      5565  0.0  0.0      0     0 ?        Z<   22:42   0:00 [kstopmachine] <defunct>
root      5566  0.0  0.0      0     0 ?        Z<   22:42   0:00 [kstopmachine] <defunct>
root      5567  0.0  0.0      0     0 ?        Z<   22:42   0:00 [kstopmachine] <defunct>
root      5568  0.0  0.0      0     0 ?        Z<   22:42   0:00 [kstopmachine] <defunct>
root      5569  0.0  0.0      0     0 ?        Z<   22:42   0:00 [kstopmachine] <defunct>
...


[watchdog/1] isn't terminated so that [kstopmachine] will waiting for it
for ever, so that "echo 0 > /sys/devices/syste/cpu/cpu1/online" will hang
up, so that hotplug spinlock will be holden by it, so that other spinlock
contenter will wait for locking it for ever, so that...

These are some softlock detection output:

process 5471 (task.sh) no longer affine to cpu1
INFO: task group_balance:51 blocked for more than 120 seconds.
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
group_balance D ffff810001011580     0    51      2
 ffff81027e97fdc0 0000000000000046 ffff81027e97fdd0 ffffffff80472b15
 0000000000000000 ffff81027e97d280 ffff81027c8bd300 ffff81027e97d5d0
 0000000080569da0 ffff81027e97d5d0 000000007e97d280 000000010009d366
Call Trace:
 [<ffffffff80472b15>] thread_return+0x3d/0xa2
 [<ffffffff8023cce5>] lock_timer_base+0x26/0x4b
 [<ffffffff8047338b>] __mutex_lock_slowpath+0x5c/0x93
 [<ffffffff80473221>] mutex_lock+0x1a/0x1e
 [<ffffffff8024f8fd>] get_online_cpus+0x27/0x3e
 [<ffffffff8022fd04>] load_balance_monitor+0x60/0x375
 [<ffffffff8022fca4>] load_balance_monitor+0x0/0x375
 [<ffffffff80245ff6>] kthread+0x47/0x75
 [<ffffffff802301f0>] schedule_tail+0x28/0x5c
 [<ffffffff8020cc78>] child_rip+0xa/0x12
 [<ffffffff80245faf>] kthread+0x0/0x75
 [<ffffffff8020cc6e>] child_rip+0x0/0x12

INFO: task bugexposure.sh:5462 blocked for more than 120 seconds.
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
bugexposure.s D ffff8100010e1580     0  5462      1
 ffff81026a981c58 0000000000000082 ffff81026a981c68 ffffffff80472b15
 00000000000004d6 ffff81027c83c920 ffff81027e8bd680 ffff81027c83cc70
 0000000d7e8bc4c0 ffff81027c83cc70 0000000d00000002 000000010009d809
Call Trace:
 [<ffffffff80472b15>] thread_return+0x3d/0xa2
 [<ffffffff80314b25>] __next_cpu+0x19/0x28
 [<ffffffff8022a5d4>] enqueue_rt_entity+0x4e/0xb1
 [<ffffffff80472f34>] schedule_timeout+0x1e/0xad
 [<ffffffff802280bf>] enqueue_task+0x4d/0x58
 [<ffffffff80472ccf>] wait_for_common+0xd5/0x118
 [<ffffffff8022c31b>] default_wake_function+0x0/0xe
 [<ffffffff80245c93>] kthread_stop+0x58/0x79
 [<ffffffff8046fdcf>] cpu_callback+0x189/0x197
 [<ffffffff8046f59e>] cpu_callback+0x1cf/0x274
 [<ffffffff8026c07e>] writeback_set_ratelimit+0x19/0x68
 [<ffffffff8047658b>] notifier_call_chain+0x29/0x4c
 [<ffffffff8024f74d>] _cpu_down+0x1d6/0x2aa
 [<ffffffff8024f845>] cpu_down+0x24/0x31
 [<ffffffff8038fcf4>] store_online+0x29/0x67
 [<ffffffff802d23e1>] sysfs_write_file+0xd2/0x110
 [<ffffffff8028db49>] vfs_write+0xad/0x136
 [<ffffffff8028e086>] sys_write+0x45/0x6e
 [<ffffffff8020bfd9>] tracesys+0xdc/0xe1
Comment 3 Rafael J. Wysocki 2008-03-05 12:34:04 UTC
*** Bug 10169 has been marked as a duplicate of this bug. ***
Comment 4 Rafael J. Wysocki 2008-03-05 14:47:28 UTC
This entry is being used for tracking a regression from 2.6.24.  Please don't
close it until the problem is fixed in the mainline.

Handled-By : Gautham R Shenoy <ego@in.ibm.com>
Comment 6 Yi Yang 2008-03-05 17:53:17 UTC
At the beginning, i thought Detect Soft Lockups
(CONFIG_DETECT_SOFTLOCKUP) is culprit because it added much code to
detect lock dependencies and deadlocks, but the issue is still there
after i disabled soft lock detection, it can be regenerated but will
take much long time.

At this time, group_balance and kblockd are in "R<" state.

[root@harwich-rh ~]# ps aux | grep kstop
root     15466  0.0  0.0      0     0 ?        S<   01:03   0:00 [kstopmachine]
root     15467  0.0  0.0      0     0 ?        Z<   01:03   0:00 [kstopmachine] <defunct>
root     15468  0.0  0.0      0     0 ?        Z<   01:03   0:00 [kstopmachine] <defunct>
root     15469  0.0  0.0      0     0 ?        Z<   01:03   0:00 [kstopmachine] <defunct>
root     15470  0.0  0.0      0     0 ?        Z<   01:03   0:00 [kstopmachine] <defunct>
root     15471  0.0  0.0      0     0 ?        Z<   01:03   0:00 [kstopmachine] <defunct>
root     15472  0.0  0.0      0     0 ?        Z<   01:03   0:00 [kstopmachine] <defunct>
root     15473  0.0  0.0      0     0 ?        Z<   01:03   0:00 [kstopmachine] <defunct>
root     15474  0.0  0.0      0     0 ?        Z<   01:03   0:00 [kstopmachine] <defunct>
root     15475  0.0  0.0      0     0 ?        Z<   01:03   0:00 [kstopmachine] <defunct>
root     15476  0.0  0.0      0     0 ?        Z<   01:03   0:00 [kstopmachine] <defunct>
root     17155  0.0  0.0  61148   708 pts/3    R+   14:47   0:00 grep kstop
[root@harwich-rh ~]# ps aux | grep "R<"
root        35  0.0  0.0      0     0 ?        R<   Mar13   0:00 [group_balance]
root       182  0.1  0.0      0     0 ?        R<   Mar13   1:09 [kblockd/0]
root     17157  0.0  0.0  61148   704 pts/3    R+   14:47   0:00 grep R<
[root@harwich-rh ~]# 
Comment 7 Yi Yang 2008-03-05 18:00:29 UTC
I also tried to specify higher priority for cpu_callback in kernel/softlockup.c so that it can be called firstly when cpu is set to offline, but the problem is still there, i really don't have any way to debug it. :-)
Comment 8 Rafael J. Wysocki 2008-03-09 15:58:49 UTC
Regressions list annotation:
References : http://lkml.org/lkml/2008/3/3/11
Comment 9 Rafael J. Wysocki 2008-03-09 16:01:35 UTC
Regressions list annotation:
Handled-By : Gregory Haskins <ghaskins@novell.com>
Patch : http://lkml.org/lkml/2008/3/8/4
Comment 10 Adrian Bunk 2008-03-09 17:44:35 UTC
fixed by commit 393d94d98b19089ec172566e23557997931b137e