Kernel Bug Tracker – Bug 9772
2.6.24-rc8 + patches: CPU hot removal while CPU is online leaves system in bad state
Last modified: 2008-07-16 15:41:45 UTC
Latest working kernel version: Unknown
Earliest failing kernel version: Unknown
Hardware Environment: x86_64
I've applied the patches attached to bug 2884 in kernel bugzilla so that I can work around the deadlock that occurs when you write to the eject node for an ACPI object.
The kernel doesn't properly handle the case of ejecting a CPU that is still online. Doing so leaves the CPU online. It still shows up in /proc/cpuinfo. And the /sys/devices/system/cpuX node still exists. But the kernel goes ahead and calls the ACPI eject method anyway, which means the hardware is free to be removed at that point. And it cleans up the /sys nodes for the ACPI tree, so it's not possible to request another ejection. This leads to system instability as things gradually hang as they attempt to interact with the CPU that has gone away.
Writing to the "eject" node for an online CPU should probably result in the ejection request being ignored because the CPU is still online. The /sys/ tree node for the ACPI device should be left intact, so another ejection can be requested after the CPU has been taken offline. The write to the eject sys node should probably fail with an error, but that's optional.
Alternatively, the kernel could automatically offline an online CPU before ejecting it. I think it's a matter of taste which behavior you prefer.
acpi_processor_remove looks like it returns -EINVAL if the CPU is online (see acpi_processor_handle_eject), but I don't think this return value is ever looked at, so the eject request isn't ever stopped. At least as far as I can see.
Steps to reproduce:
echo 1 > /sys/devices/LNXSYSTM:00/device:00/ACPI0007:01/eject while CPU is still online. Watch your shell hang. Wait a few minutes and watch the whole system hang.
I've been looking at this area too, as I'm starting to investigate CPU hotplug on ia64.
I'm not really sure what to make of the check in acpi_processor_handle_eject() that returns -EINVAL if the CPU is online. Given the fact that Documentation/cpu-hotplug.txt talks about receiving uevents to drive offlining, and the fact that the implementation of acpi_processor_hotplug_notify() really doesn't do much other than emit uevents, it seems that we have a chicken and egg problem here.
Ugh, it also seems that acpi_processor_handle_eject() calls arch_unregister_cpu(), which removes the /sys/devices/system/cpu/cpuX/online attribute, so a udev script can't do the right thing.
One possible solution would be for acpi_processor_hotplug_notify() to poke a finger into the base cpu driver and call cpu_down() when it receives an eject request, but that would negate the whole point of emitting uevents and letting userspace take care of it.
The other possibility is to move the call to arch_unregister_cpu into the cpu_down path. This still leaves open the question of what to do with acpi_unmap_lsapic(), but it might help untangle this mess.
In order to hot remove a CPU,
Please echo 0 > /sys/devices/system/cpu/cpuX/online first before echo 1 > /sys/bus/acpi/drivers/processor/ACPI0007:0X/eject.
Is this work for you?
(In reply to comment #1)
> Ugh, it also seems that acpi_processor_handle_eject() calls
> arch_unregister_cpu(), which removes the /sys/devices/system/cpu/cpuX/online
> attribute, so a udev script can't do the right thing.
IMO, for cpu hot removal, it should work like this:
users prepare to hot removal a CPU, notification ACPI_NOTIFY_EJECT_REQUEST is generated. acpi_processor_hotplug_notify send a KOBJ_OFFLINE uevent.
Userspace should echo 0 > /sys/devices/system/.../cpuX/online to offline the processor, and then echo 1 > /sys/bus/acpi/.../ACPI0007:0x/eject to remove the processor from Linux device tree, including the acpi dev and sysdev.
> One possible solution would be for acpi_processor_hotplug_notify() to poke a
> finger into the base cpu driver and call cpu_down() when it receives an eject
> request, but that would negate the whole point of emitting uevents and
> letting userspace take care of it.
IMO, it works, but you're right that user space should take care of this.
> The other possibility is to move the call to arch_unregister_cpu into the
> cpu_down path.
Then what about the acpi device node for processor? maybe we need a platform hook here?
There's a couple of problems.
/sys/devices/system/cpu/cpuX/online doesn't always correspond to /sys/bus/acpi/drivers/processor/ACPI0007:0X. Especially if you remove and then re-add CPUS, you can get a situation where CPU3 corresponds to ACPI0007:08, or something similar. I sent mail to lkml and linux-acpi describing the issue, but nobody responded. So you can't always offline the correct CPU before ejecting it.
In my testing, I'm always careful to offline the CPU before ejecting it.
It just seems really bad to make it this easy to accidentally hang the whole system. Ejecting a CPU really needs to fail if the CPU is online, or else it needs to automatically offline the CPU. Failing if the CPU is online seems to make the most sense.
Hrm, this is odd. I don't seem to have an eject file in sysfs:
[root@canola ACPI0007:02]# pwd
[root@canola ACPI0007:02]# ls -F
bus@ driver@ hid modalias path subsystem@ uevent
[root@canola ACPI0007:02]# uname -a
Linux canola.achiang 2.6.23-0.224.rc9.git6 #1 SMP Wed Oct 17 13:32:26 EDT 2007 ia64 ia64 ia64 GNU/Linux
I do have both cpu hotplug and acpi cpu hotplug kconfig'ed in...
Does your firmware have an _EJ0 method for the processor objects? You should be able to tell by dumping the DSDT or whatever.
(In reply to comment #5)
> Does your firmware have an _EJ0 method for the processor objects? You should
> be able to tell by dumping the DSDT or whatever.
Yes. the "eject" attribute is created for the devices that have a _EJ0 methods.
And IMO, for a processor that supports hot-plug,
the _EJ0 method is mandatory, isn't it?
Please attach the acpidump output.
the latest pmtools can be downloaded here:
(In reply to comment #3)
> There's a couple of problems.
So it really works for you, right?
ie, "echo 0 > online" first and "echo 1 > eject" then can hot remove a processor successfully.
> /sys/devices/system/cpu/cpuX/online doesn't always correspond to
> /sys/bus/acpi/drivers/processor/ACPI0007:0X. Especially if you remove and then
> re-add CPUS, you can get a situation where CPU3 corresponds to ACPI0007:08, or
> something similar.
Yes. Hmm, can we create a sysfs link here, from cpu sysdev to acpi device node?
> Ejecting a CPU really needs to fail if the CPU is online, or else it
> needs to automatically offline the CPU. Failing if the CPU is online seems > to make the most sense.
Yes. you're right. I'll look into this.
Yes, it works if I offline the CPU first, then echo 1 into the eject node.
I think what you need is a symlink from the ACPI device node to the CPU sysdev. When you get the eject request, you're given the ACPI device node, and can't figure out what the CPU number is (unless they just happen to match.)
Created attachment 14618 [details]
this patch forces offline the cpu if it's still online when poking the "eject" attribute.
Daniel, if the patch works, you don't need to poke the "online" attribute before echo 1 > eject. would you please give it a try. :)
Created attachment 14619 [details]
this patch creates a sysfs link from the acpi device node to sysdev node.
Both patches seem to work as advertised. Thanks!
I still need to do some stress testing, but things now look a lot better than they did.
I'm not sure what the protocol here is. Do I mark the bug as fixed with CODE_FIX, or do I not do that until the patch is actually in the kernel?
I take it back. The force-offline-processor patch doesn't seem to be working quite right yet.
The processor goes offline, but the kernel never seems to invoke the _EJ0 method for the CPU object, and the ACPI0007:01 sysnode tree isn't cleaned up. Invoking the eject method again causes the command to complete, but my bash shell locks up. I'm able to start a reboot via control-alt-delete, but I get messages on the console from init that say that various pids seem to hang, and the reboot process hangs. I have to power cycle the machine.
With the other patch, I can at least fix up my /sbin/hotplug script so that it always offlines the correct CPU before issuing the eject though, which is nice!
(In reply to comment #12)
> The processor goes offline, but the kernel never seems to invoke the _EJ0
> method for the CPU object, and the ACPI0007:01 sysnode tree isn't cleaned up.
Wreid, can you attach the dmesg output and "tree ACPI0007:01" after eject the processor for the first time?
So the behavior is strange.
If I power on with 2 CPUs, then echo 1 > .../ACPI0007:01/eject with CPU 1 online, things work.
If I power on with 4 CPUs, then I can remove cpus 3 and 2 via echo 1 > ACPI0007:0X/eject with the CPUs online, and things work. But after doing this, I try to eject CPU 1 with it online via echo 1 > .../ACPI0007:01/eject, and nothing happens, aside from CPU 1 going offline.
Here's what I see in dmesg:
CPU 3 is now offline
CPU 2 is now offline
CPU 1 is now offline
SMP alternatives: switching to UP code
My SLES 10 OS doesn't have a "tree" command in root's PATH. I can try to find the source for tree and compile it if that's really necessary. Here's what ls -l ACPI0007:01 shows:
lrwxrwxrwx 1 root root 0 Jan 30 09:06 bus -> ../../../../bus/acpi
--w------- 1 root root 0 Jan 30 09:07 eject
-r--r--r-- 1 root root 4096 Jan 30 09:06 hid
-r--r--r-- 1 root root 4096 Jan 30 09:06 modalias
-r--r--r-- 1 root root 4096 Jan 30 09:06 path
drwxr-xr-x 2 root root 0 Jan 30 09:06 power
lrwxrwxrwx 1 root root 0 Jan 30 09:06 subsystem -> ../../../../bus/acpi
lrwxrwxrwx 1 root root 0 Jan 30 09:06 sysdev -> ../../../../devices/system/cpu/cpu1
-rw-r--r-- 1 root root 4096 Jan 30 09:06 uevent
Onlining CPU 1 once I'm in this state doesn't work.
(In reply to comment #14)
> I try to eject CPU 1 with it online via echo 1 > .../ACPI0007:01/eject, and
> nothing happens, aside from CPU 1 going offline.
> Here's what I see in dmesg:
> CPU 3 is now offline
> CPU 2 is now offline
> CPU 1 is now offline
> SMP alternatives: switching to UP code
what can you see if you disable CPU1 with only two CPUs enabled?
> My SLES 10 OS doesn't have a "tree" command in root's PATH. I can try to find
> the source for tree and compile it if that's really necessary. Here's what ls
no, you don't need to do this. the following info is enough. :)
> -l ACPI0007:01 shows:
> total 0
> lrwxrwxrwx 1 root root 0 Jan 30 09:06 bus -> ../../../../bus/acpi
> --w------- 1 root root 0 Jan 30 09:07 eject
> -r--r--r-- 1 root root 4096 Jan 30 09:06 hid
> -r--r--r-- 1 root root 4096 Jan 30 09:06 modalias
> -r--r--r-- 1 root root 4096 Jan 30 09:06 path
> drwxr-xr-x 2 root root 0 Jan 30 09:06 power
> lrwxrwxrwx 1 root root 0 Jan 30 09:06 subsystem -> ../../../../bus/acpi
> lrwxrwxrwx 1 root root 0 Jan 30 09:06 sysdev ->
> -rw-r--r-- 1 root root 4096 Jan 30 09:06 uevent
device_unregister() is not executed correctly?
please try the debug patch I attached, so that we can find out why these sys attributes still exist.
Please echo 0x04 > /sys/module/acpi/parameters/debug_layer and
echo 0x8800001f > /sys/module/acpi/parameters/debug_level
before eject the cpu and attach the dmesg output
Created attachment 14701 [details]
I was able to get one hang to happen, but I applied your debug patch, turned on ACPI debugging in my kernel config, and now I can't get the failure to happen at all.
Dmesg output after the failed ejection is a bit insightful, though this was from before I had debugging turned on or the additional information:
events/0 R running task 0 7 2
events/1 D ffffffff805ff778 0 8 2
ffff81000144fbe0 0000000000000046 ffff81000fa26800 ffff81000ffd6040
ffff81000ffd6040 ffffffff8022f327 ffff81000fa26800 ffff8100015317c0
000000000000000f 0000000000000007 ffff81000fa26800 0000000000000001
Maybe I just didn't wait long enough, and it would have eventually completed?
I'll update if I can get this problem to reproduce.
Oh, one other thing.
processor_device_array[pr->id] = NULL;
processors[pr->id] = NULL;
or else acpi_processor_start goes bad in the buggy BIOS check if you remove a CPU then add one back in.
I was able to get the hang to occur. It's intermittent, but seems to happen more when I've got 2 CPUs than when I have more, and it definitely doesn't happen all the time. Maybe it has to do with one of the threads running on CPU 1, and not migrating to CPU 0 when it gets offlined, or something?
Here's the dmesg output. It doesn't look super useful:
IA-32 Microcode Update Driver: v1.14a <email@example.com>
CPU 3 is now offline
CPU 2 is now offline
CPU 1 is now offline
SMP alternatives: switching to UP code
The kernel thread is stuck waiting with the same stack I reported up in comment #17.
Created attachment 14969 [details]
Patch: hot remove processor in a kernel thread instead of work struct
please try this patch on top of the 01-patch-force-offline-proceesor and see if it helps.
but the problem is that Len has dropped this patch set as Alex and Thomas are working on something "simpler"...
So we'd better add them in the cc list, :)
The latest patch looks pretty good to me. I was able to do about 20 hot removes/hot adds before I got bored.
I can help out with testing the simpler solution when patches are available.
Created attachment 15558 [details]
Created attachment 15559 [details]
Created attachment 15560 [details]
Created attachment 15561 [details]
Please try this patch set on a clean 2.6.25 kernel. :)
Tested these new set of patches from Zhang on a 64bit kernel on top of 2.6.25-rc8.
For 32bit kernels tried these on 2.6.25-rc8-x86 tree as the 32bit ACPI boot code is fixed in the X86 tree with the smpboot integration patches submitted by Glauber.
I was able to eject CPUs from the system and it worked as expected.
Please let me know if you need any more help on this to puch them upstream.
Created attachment 15682 [details]
fix a typo in patch 03-force-offline-cpu
IMO, patch in comment #22 #23 and #28 are ready for in 2.6.26. :)
For the patch in comment #25, we don't need this patch if the patch 03-force-offline-cpu is applied because we don't need this sysfs link to poke the "online" attribute under cpu sysdev node before poking the "eject" file.
Alok, what do you think?
(In reply to comment #29)
> IMO, patch in comment #22 #23 and #28 are ready for in 2.6.26. :)
> For the patch in comment #25, we don't need this patch if the patch
> 03-force-offline-cpu is applied because we don't need this sysfs link to poke
> the "online" attribute under cpu sysdev node before poking the "eject" file.
> Alok, what do you think?
IMO it would useful when we are re-adding the cpu to the system. In that case we can have scripts in userspace to bring the cpu online as soon as that cpu is added to the system.
Now we can have a similar feature in kernel which implicitly brings the cpu online as soon as its added, but then I think we loose the flexibility for the user to decide if he wants the cpu online as soon as its added to the system.
So IMHO lets keep that patch, unless anybody disagrees.
I'm concerned that if the access to the eject file completes
before the actual eject, that it may be possible for user-space
to get confused. While this may be a little academic, it sounds
like the kind of thing where there will be a race.
Simpler for the user if the eject is guaranteed to be complete
when the access to the file returns.
So instead of deferring the actual eject so the return can complete,
how about if we execute the eject and defer removing the file until
the user access is complete? That way when we return to the user,
we are guaranteed that the eject is done.
I made a double check and found that we can't defer removing the file until the user access is complete.
As ejecting a device requires to remove the ACPI device node from Linux device tree, the "eject" file must be removed before calling device_unregister, because it is an attribute of the device node.
So I still think the proposed patches works. and they can be shipped in acpi-test for now.
patch in comment #22 applied
patch in comment #23, 28, 25 applied to acpi-test
re: patch in comment #23:
dos2unix: converting file 9772-23.patch to UNIX format ...
WARNING: Use #include <linux/signal.h> instead of <asm/signal.h>
#26: FILE: drivers/acpi/scan.c:9:
ERROR: Don't use kernel_thread(): see Documentation/feature-removal-schedule.txt
#117: FILE: drivers/acpi/scan.c:174:
+ ret = kernel_thread(acpi_bus_hot_remove_device,
patch in comment #22 shipped in linux-2.6.25-git16
Created attachment 16300 [details]
Fixes some checkpatch.pl warnings introduced by patch in comment #23
(In reply to comment #33)
> patch in comment #22 applied
> patch in comment #23, 28, 25 applied to acpi-test
> re: patch in comment #23:
> dos2unix: converting file 9772-23.patch to UNIX format ...
> WARNING: Use #include <linux/signal.h> instead of <asm/signal.h>
> #26: FILE: drivers/acpi/scan.c:9:
> +#include <asm/signal.h>
> ERROR: Don't use kernel_thread(): see
> #117: FILE: drivers/acpi/scan.c:174:
> + ret = kernel_thread(acpi_bus_hot_remove_device,
While going through the mainline tree i noticed that these patches 23, 28, 25
are still not in the mainline. Was wondering if this checkpatch warning the
reason for holding the patches back, or is there some other issue with the patches ?
I have cooked up a patch to fix these warnings with checkpatch, in comment #35. Please have a look. If there is any other issue with it, please do let
len, i think they can be shipped to acpi-test first.
> patch in comment #23, 28, 25 applied to acpi-test
These patches were checked into acpi-test on April 28th.
They were not low risk enough to qualify for a fast track into
the 2.6.26 merge window, which closed on may 3rd.
Thus they have always been targeted for 2.6.27.
they have been present in acpi-test, linux-next
and every mm release since then:
you may have pulled acpi-test at a time when it was being
re-based and the patches above (and others) where thus
temporarily not present.
Thanks for the patch in comment #35.
I've applied it to this branch and we will propose it
for the 2.6.27 merge window.
fixed in Linus' tree