Bug 42981
Summary: | Processor Aggregator Device is not stable causing FW-OS communication to stop | ||
---|---|---|---|
Product: | ACPI | Reporter: | Sebastian Jarosz (sebastian.jarosz) |
Component: | Power-Processor | Assignee: | Len Brown (lenb) |
Status: | CLOSED CODE_FIX | ||
Severity: | high | CC: | alan, florian, lenb, stuart_hayes |
Priority: | P1 | ||
Hardware: | x86-64 | ||
OS: | Linux | ||
Kernel Version: | 2.6.32-131.0.15.el6.x86_64 ,. 3.1.4+ | Subsystem: | |
Regression: | No | Bisected commit-id: | |
Attachments: | patch vs 3.5-rc2 |
Description
Sebastian Jarosz
2012-03-23 17:01:38 UTC
In drivers/acpi/acpi_pad.c, the isolated_cpus_lock mutex is being held by destroy_power_saving_task(), which calls kthread_stop() on each power_saving thread. Kthread_stop() is waiting for the thread to end. But the power_saving thread tries to get the isolated_cpus_lock mutex in round_robin_cpu(). If any of the power_saving threads try to get this mutex after destroy_power_saving_task() starts killing the threads, there's a deadlock kthread_stop() waiting for the thread to end, and the thread is waiting to get the mutex. I would suggest (and have tested with an older kernel) creating a new mutex round_robin_cpus_lock, and changing round_robin_cpus() to use that mutex instead of the isolated_cpus_lock. It fixed the issue, and I didn't see the point of having round_robin_cpus() use the same lock as the other functions which use isolated_cpus_lock. In drivers/acpi/acpi_pad.c, the isolated_cpus_lock mutex is being held by destroy_power_saving_task(), which calls kthread_stop() on each power_saving thread. Kthread_stop() is waiting for the thread to end. But the power_saving thread tries to get the isolated_cpus_lock mutex in round_robin_cpu(). If any of the power_saving threads try to get this mutex after destroy_power_saving_task() starts killing the threads, there's a deadlock kthread_stop() waiting for the thread to end, and the thread is waiting to get the mutex. I would suggest (and have tested with an older kernel) creating a new mutex round_robin_cpus_lock, and changing round_robin_cpus() to use that mutex instead of the isolated_cpus_lock. It fixed the issue, and I didn't see the point of having round_robin_cpus() use the same lock as the other functions which use isolated_cpus_lock. (Please ignore comment #2... it was an accidental repost of comment #1.) How to reproduce this issue: First load the acpi_pad driver. acpi_pad binds to ACPI000C, the processor aggregator device. If you have one of those, the driver will load and you'll see ACPI000C in sysfs with an "idlecpus" attribute beneath it: /sys/devices/LNXSYSTM:00/LNXSYBUS:00/PNP0A08:00/ACPI000C:00/idlecpus If you don't have ACPI000C, then hack acpi_pad.c to replace it with another pnp-id that is present in the DSDT, but has no driver bound. Here I used PNP0C14 (and I removed acpi_wmi from my kernel, as it would bind to PNP0C14) /sys/devices/LNXSYSTM:00/device:00/PNP0C14:00/idlecpus cd to the sysfs directory with this acpi_pad attribute. Here we are running on a system with 32 logical processors, so we'll repeatedly take 31 off-line, then attempt to re-enable them all at once by asking for 0 off line. But first we reduce the round-robin-time, to make the failure more likely: # echo 1 > rrtime # echo 31 > idlecpus; echo 0 > idlecpus # echo 31 > idlecpus; echo 0 > idlecpus # echo 31 > idlecpus; echo 0 > idlecpus (it usually takes only a few attempts) etc. until the echo does not return subsequent writes to idlecpus will hang the write. # rmmod acpi_pad will now hang. # ps -ef |grep power_saving will show a bunch of hung power saving threads. The only way to clear this condition and to again have the capability of acpi_pad forcing cpus to idle is to reboot. Created attachment 73661 [details]
patch vs 3.5-rc2
patch from Stuart Hayes, as applied.
A patch referencing this bug report has been merged in Linux v3.5-rc5: commit 5f1601261050251a5ca293378b492a69d590dacb Author: Stuart Hayes <Stuart_Hayes@Dell.com> Date: Wed Jun 13 16:10:45 2012 -0500 acpi_pad: fix power_saving thread deadlock |