Bug 42382
Summary: | Soft-lockup during cpu-hotplug in VFS callpaths | ||
---|---|---|---|
Product: | Power Management | Reporter: | Srivatsa S. Bhat (srivatsa) |
Component: | Other | Assignee: | power-management_other |
Status: | CLOSED DUPLICATE | ||
Severity: | normal | CC: | srivatsa |
Priority: | P1 | ||
Hardware: | All | ||
OS: | Linux | ||
Kernel Version: | 3.0.1, 3.0.3 | Subsystem: | |
Regression: | Yes | Bisected commit-id: | |
Attachments: | Soft-lockup_log |
Created attachment 71732 [details] Soft-lockup_log While running stressful cpu hotplug tests along with kernel compilation running in background, soft-lockups are detected on multiple CPUs. Sometimes this also leads to hard lockups and kernel panic. All the soft-lockups seem to occur at vfsmount_lock_local_cpu() or other VFS callpaths. [37108.410813] BUG: soft lockup - CPU#5 stuck for 22s! [cc1:29669] <snip> [37108.694781] Call Trace: [37108.697306] [<ffffffff81199e70>] ? vfsmount_lock_local_lock_cpu+0x70/0x70 [37108.704258] [<ffffffff81187cb5>] path_init+0x315/0x400 [37108.709558] [<ffffffff8127c398>] ? __raw_spin_lock_init+0x38/0x70 [37108.715812] [<ffffffff8118961c>] path_openat+0x8c/0x3f0 [37108.721203] [<ffffffff81012129>] ? sched_clock+0x9/0x10 [37108.726597] [<ffffffff8109416d>] ? sched_clock_cpu+0xcd/0x110 [37108.732508] [<ffffffff810a178d>] ? trace_hardirqs_off+0xd/0x10 [37108.738498] [<ffffffff8109421f>] ? local_clock+0x6f/0x80 [37108.743970] [<ffffffff81189a99>] do_filp_open+0x49/0xa0 [37108.749362] [<ffffffff811982f3>] ? alloc_fd+0xc3/0x210 [37108.754665] [<ffffffff8152584b>] ? _raw_spin_unlock+0x2b/0x40 [37108.760575] [<ffffffff811982f3>] ? alloc_fd+0xc3/0x210 [37108.765875] [<ffffffff81179607>] do_sys_open+0x107/0x1e0 [37108.771352] [<ffffffff810d610f>] ? audit_syscall_entry+0x1bf/0x1f0 [37108.777695] [<ffffffff81179720>] sys_open+0x20/0x30 [37108.782741] [<ffffffff8152e202>] system_call_fastpath+0x16/0x1b Hardware: Dual socket quad-core hyper-threaded Intel x86 machine Scenario: (a) Stressful cpu hotplug tests + kernel compilation (b) IRQ balancing had been disabled and all the IRQs were made to be routed to CPU 0 (except the ones that couldn't be routed). (c) Lockdep was enabled during kernel configuration. Steps (b) and (c) were done to dig deeper into the issue. However the same issue was observed by just doing step (a). Definitely there seems to be a race condition occurring here, because this issue is hit after sometime, after starting the tests. And the time it takes to hit the issue increases as we increase the number of debug print statements. In some cases (especially when the number of debug print statements were quite high), the stress on the machine had to be increased in order to hit the issue within measurable time. In my tests, a maximum of about 2 to 2.5 hours was sufficient, to hit this bug.