Bug 42382

Summary: Soft-lockup during cpu-hotplug in VFS callpaths
Product: Power Management Reporter: Srivatsa S. Bhat (srivatsa)
Component: OtherAssignee: power-management_other
Severity: normal CC: srivatsa
Priority: P1    
Hardware: All   
OS: Linux   
Kernel Version: 3.0.1, 3.0.3 Subsystem:
Regression: Yes Bisected commit-id:
Attachments: Soft-lockup_log

Description Srivatsa S. Bhat 2011-09-05 10:10:47 UTC
Created attachment 71732 [details]

While running stressful cpu hotplug tests along with kernel compilation
running in background, soft-lockups are detected on multiple CPUs.
Sometimes this also leads to hard lockups and kernel panic.
All the soft-lockups seem to occur at vfsmount_lock_local_cpu() or other VFS

[37108.410813] BUG: soft lockup - CPU#5 stuck for 22s! [cc1:29669]
[37108.694781] Call Trace:
[37108.697306]  [<ffffffff81199e70>] ? vfsmount_lock_local_lock_cpu+0x70/0x70
[37108.704258]  [<ffffffff81187cb5>] path_init+0x315/0x400
[37108.709558]  [<ffffffff8127c398>] ? __raw_spin_lock_init+0x38/0x70
[37108.715812]  [<ffffffff8118961c>] path_openat+0x8c/0x3f0
[37108.721203]  [<ffffffff81012129>] ? sched_clock+0x9/0x10
[37108.726597]  [<ffffffff8109416d>] ? sched_clock_cpu+0xcd/0x110
[37108.732508]  [<ffffffff810a178d>] ? trace_hardirqs_off+0xd/0x10
[37108.738498]  [<ffffffff8109421f>] ? local_clock+0x6f/0x80
[37108.743970]  [<ffffffff81189a99>] do_filp_open+0x49/0xa0
[37108.749362]  [<ffffffff811982f3>] ? alloc_fd+0xc3/0x210
[37108.754665]  [<ffffffff8152584b>] ? _raw_spin_unlock+0x2b/0x40
[37108.760575]  [<ffffffff811982f3>] ? alloc_fd+0xc3/0x210
[37108.765875]  [<ffffffff81179607>] do_sys_open+0x107/0x1e0
[37108.771352]  [<ffffffff810d610f>] ? audit_syscall_entry+0x1bf/0x1f0
[37108.777695]  [<ffffffff81179720>] sys_open+0x20/0x30
[37108.782741]  [<ffffffff8152e202>] system_call_fastpath+0x16/0x1b

Hardware: Dual socket quad-core hyper-threaded Intel x86 machine
(a) Stressful cpu hotplug tests + kernel compilation

(b) IRQ balancing had been disabled and all the IRQs  were made to be
    routed to CPU 0 (except the ones that couldn't be routed).

(c) Lockdep was enabled during kernel configuration.

Steps (b) and (c) were done to dig deeper into the issue. However the same
issue was observed by just doing step (a).

Definitely there seems to be a race condition occurring here, because this
issue is hit after sometime, after starting the tests. And the time it
takes to hit the issue increases as we increase the number of debug print
statements. In some cases (especially when the number of debug print
statements were quite high), the stress on the machine had to be increased
in order to hit the issue within measurable time. In my tests, a maximum
of about 2 to 2.5 hours was sufficient, to hit this bug.
Comment 1 Srivatsa S. Bhat 2011-09-06 08:07:52 UTC

*** This bug has been marked as a duplicate of bug 42402 ***