The Linux kernel does not handle device_shutdown() for the libata buses. The
bus has no .shutdown method. The devices have no .shutdown handlers.
Therefore, device caches are not flushed prior to a system restart, system sleep
or system powerdown, and device heads are not unloaded before a system sleep or
powerdown, be it a transition to ACPI S3, S4 or S5 state, APM sleep/poweroff, or
non-PM-poweroff (the one that just asks the user to powerdown the machine).
* Potential data loss if userspace does not manage to flush caches using the
* The need for the userspace halt(8) utility to try to avoid the above
mentioned data loss (incidently, the lack of a proper high-level kernel
interface to do device cache flushes or device head unloads also means the
userspace code to do it is disgusting)
* Uneeded device wear (disk head assembly) if userspace or ACPI BIOS firmware
does not unload heads prior to power off.
The disk's "head auto-park" on power off is actually called "emergency head
unload" on modern disk drives by the HD manufacturers for a reason, and wears
down the drive mechanics a lot more, causing the drive's lifetime to be
drastically reduced. Hitachi documentation for laptop HDs mention an emergency
unload as being from 20 to 100 times more stressfull to the drive mechanics, and
makes it clear it should be avoided.
Note that it is important to avoid unloading disk heads without a reason
(especially on non-laptop drives), so the shutdown handler must differentiate a
system restart (which should just sync caches) from a system power off (which
needs to sync caches and also to unload heads).
The fact that userspace will likely try to unload heads if it can before a
shutdown means that either the kernel needs to snoop the few ATA commands that
cause head unloads and spin-downs, to avoid reissuing such commands needlessly,
or it needs to be easy for userspace to detect that the running kernel is one
that can handle disk device shutdowns by itself. Otherwise, during a system
shutdown sequence, the HDs are likely to be spun down (due to userspace
command), only to be spun up and down again when the kernel attempts to shutdown
Also note that it is basically impossible for userspace to cover up for the lack
of disk device shutdown handling by the kernel for ACPI S4 (suspend-to-disk),
and that userspace code may fail to do it even for the normal shutdown if it
doesn't find the devices, or doesn't know how to do it or that it should do it
(e.g. Debian Sarge cannot properly shutdown libata devices).
Steps to reproduce:
* Patch halt(8) or the shutdown scripts to not attempt to issue commands to
park heands and sync caches, and listen closely to the ammount of noise the HD
will make when the system powers down. Note: this could conceivably kill the
HD, and even if it doesn't, it will most surely reduce its working lifetime as
much as pulling the plug with the machine running would.
* Try to suspend-to-disk using ACPI S4 a ThinkPad T43 (other models might also
have the same issue, but I have personally verified that a T43 BIOS does not
unload disc heads for ACPI S4).
Cache *is* synchronized prior to shutting down. It happens via sd->shutdown
issuing SCSI SYNCHRONIZE_CACHE which gets translated into ATA_CMD_FLUSH[_EXT] as
appropriate. Heads are unloaded properly on sleep to memory too.
IIRC, sd_shutdown used to issue SCSI START_STOP command to stop (unload) drives.
The command is translated into ATA_STANDBY which unloads the head. However,
this behavior changed because of multipath SCSI devices. SCSI devices can be
connected to multiple hosts and unloading heads when one host is going down
causes problems to other hosts. I'll think about how to solve this.
So, THERE IS *NO* DATA LOSS. CACHE IS *ALWAYS* SYNCHRONIZED BEFORE SHUTTING DOWN.
I am *very* happy to hear this, and I apologise for the "lack of sync cache"
misreport. I screwed up and paid too much attention to the libata side of
things, and failed to check the scsi infrastructure properly, thus missing sd.c.
I can also definately verify that indeed heads are being unloaded on S3. I
assumed the ACPI BIOS was doing this, as S4 and S5 are not being properly
handled [from a non-multipath device point of view]... another one to apologise
for, I should never "assume" anything.
The other points still stand, though: it would be nice to special case commands
that unload heads/spin down devices/detach(ATA sleep) devices, so that we avoid
spinning up devices to issue these commands, or worse, doing full EH and a ATA
bus reset to wake up a detached/sleeping drive only to issue a unload head/spin
down/sleep command to it...
Also, doing a head unload on reboot would not be nice, as minimizing head
unloads will increase device lifespan (especially on desktop drives), therefore
whatever is done to fix the need for head unloads on poweroff issue should not
cause head unloads or platter spindown on system restart, just on power off/halt.
Indeed multipath causes several problems for the above. I'd say the only way to
deal with multipath sanely would be to always assume a single path by default
for the purposes of the above problems, and to have userspace interfaces capable
of telling the kernel that a device is multipath, all the local paths (so that
the kernel knows that all disc-specific device state for these paths is shared,
even if bus states are not), and whether there are external paths or not.
For all I know, most of this is already done by the multipath suite (I have no
experience with multipath). Still, for the issues related to this bug report,
multipath devices without external paths should probably have all the handling
of a single path device, except that the kernel would need to avoid issuing the
same set of commands to the device over the other local paths.
It would be also EXTREMELY nice to know from userspace whether we need to
attempt to issue cache flushes and disk standby commands on the halt(8) command
or not (because the running kernel will do it). I'd would *very* much like to
remove the utter gross crap in halt.c that deals with it in two or three
years... is there a way to query the kernel, directly or indirectly, for this
My memory about sd used to stopping disks seems to be fabricated. I'm probably
confusing with a discussion thread on linux-scsi. Anyways, I'm attaching a
patch. It's against 2.6.20-rc5 but should also work with 2.6.19. The
followings are added by the patch.
stop_on_shutdown_default defaults to zero and stop_on_shutdown's are initialized
from it. To enable stop_on_shutdown globally, you first set
stop_on_shutdown_default to 1 then set all present stop_on_shutdown to 1.
If stop_on_shutdown for a disk is 1, sd will issue SCSI START_STOP command with
START_VALID and START == 0 which tells the disk to stop on shutdown except for
restart. As before, SYNCHRONIZE_CACHE is issued unconditionally which does not
usually cause spinup on a spun down device.
Would this be enough? If you ACK this, I'll try to push this mainline. Thanks.
Created attachment 10122 [details]
Yes, it is enough. I would still prefer to have
/sys/module/sd_mod/parameters/stop_on_shutdown_default set to 1 by default, as
multipath is the less common setup (it is, in fact, rare when compared with
non-multipath setups now that scsi also means SATA, and soon, PATA), and
stop_on_shutdown_default=1 is the correct configuration for anything *but*
multi-host multipath setups.
But we can certainly set /sys/module/sd_mod/parameters/stop_on_shutdown_default
properly in userspace, and it is *extremely* better than what we have now.
Acked-by: Henrique de Moraes Holschuh <firstname.lastname@example.org>
BTW: Thank you very much for the fast response to this bug :-)
You're welcome. :-)
I agree that it would be nicer to have 1 as the default 'default' value but I'm
afraid we can't do that as that would change kernel behavior which userland
might be relying on. Well, it's not like configuring this during boot is difficult.
I'll submit the patch. Thanks.
Oh.. BTW, s/multipath/multi initiator/.
One question: What happened to this patch? I have a T60 here (AHCI) and needed
to include this patch manually to be able to switch the device off cleanly
(without SMART values disk_shift and power-off_retract_count increased).
SCSI part is committed into mainline. libata part is still pending. This can
appear in mainline 2.6.22 at the earliest.
Thanks for your quick answer. Hope I don't bother you if I ask: I use
linux-2.6.21-rc6-git4 here and had to apply the scsi part manually. When will
this patch be applied to the source code (I know that I don't know the exact
procedure for patch inclusion - sorry for that)
It's in scsi-misc-2.6 tree that the SCSI maintainer maintains. It was late for
2.6.21 inclusion and will be merged into mainline during 2.6.22-rc1 cycle.
This issue has been dealt with for a long time now. Any remaining fallout is being tracked separately on its own bug entries, so I am marking this bug closed.