Problem Description: The Linux kernel does not handle device_shutdown() for the libata buses. The bus has no .shutdown method. The devices have no .shutdown handlers. Therefore, device caches are not flushed prior to a system restart, system sleep or system powerdown, and device heads are not unloaded before a system sleep or powerdown, be it a transition to ACPI S3, S4 or S5 state, APM sleep/poweroff, or non-PM-poweroff (the one that just asks the user to powerdown the machine). This causes: * Potential data loss if userspace does not manage to flush caches using the passthrough * The need for the userspace halt(8) utility to try to avoid the above mentioned data loss (incidently, the lack of a proper high-level kernel interface to do device cache flushes or device head unloads also means the userspace code to do it is disgusting) * Uneeded device wear (disk head assembly) if userspace or ACPI BIOS firmware does not unload heads prior to power off. The disk's "head auto-park" on power off is actually called "emergency head unload" on modern disk drives by the HD manufacturers for a reason, and wears down the drive mechanics a lot more, causing the drive's lifetime to be drastically reduced. Hitachi documentation for laptop HDs mention an emergency unload as being from 20 to 100 times more stressfull to the drive mechanics, and makes it clear it should be avoided. Note that it is important to avoid unloading disk heads without a reason (especially on non-laptop drives), so the shutdown handler must differentiate a system restart (which should just sync caches) from a system power off (which needs to sync caches and also to unload heads). The fact that userspace will likely try to unload heads if it can before a shutdown means that either the kernel needs to snoop the few ATA commands that cause head unloads and spin-downs, to avoid reissuing such commands needlessly, or it needs to be easy for userspace to detect that the running kernel is one that can handle disk device shutdowns by itself. Otherwise, during a system shutdown sequence, the HDs are likely to be spun down (due to userspace command), only to be spun up and down again when the kernel attempts to shutdown the device. Also note that it is basically impossible for userspace to cover up for the lack of disk device shutdown handling by the kernel for ACPI S4 (suspend-to-disk), and that userspace code may fail to do it even for the normal shutdown if it doesn't find the devices, or doesn't know how to do it or that it should do it (e.g. Debian Sarge cannot properly shutdown libata devices). Steps to reproduce: * Patch halt(8) or the shutdown scripts to not attempt to issue commands to park heands and sync caches, and listen closely to the ammount of noise the HD will make when the system powers down. Note: this could conceivably kill the HD, and even if it doesn't, it will most surely reduce its working lifetime as much as pulling the plug with the machine running would. * Try to suspend-to-disk using ACPI S4 a ThinkPad T43 (other models might also have the same issue, but I have personally verified that a T43 BIOS does not unload disc heads for ACPI S4).
Cache *is* synchronized prior to shutting down. It happens via sd->shutdown issuing SCSI SYNCHRONIZE_CACHE which gets translated into ATA_CMD_FLUSH[_EXT] as appropriate. Heads are unloaded properly on sleep to memory too. IIRC, sd_shutdown used to issue SCSI START_STOP command to stop (unload) drives. The command is translated into ATA_STANDBY which unloads the head. However, this behavior changed because of multipath SCSI devices. SCSI devices can be connected to multiple hosts and unloading heads when one host is going down causes problems to other hosts. I'll think about how to solve this. So, THERE IS *NO* DATA LOSS. CACHE IS *ALWAYS* SYNCHRONIZED BEFORE SHUTTING DOWN.
I am *very* happy to hear this, and I apologise for the "lack of sync cache" misreport. I screwed up and paid too much attention to the libata side of things, and failed to check the scsi infrastructure properly, thus missing sd.c. I can also definately verify that indeed heads are being unloaded on S3. I assumed the ACPI BIOS was doing this, as S4 and S5 are not being properly handled [from a non-multipath device point of view]... another one to apologise for, I should never "assume" anything. The other points still stand, though: it would be nice to special case commands that unload heads/spin down devices/detach(ATA sleep) devices, so that we avoid spinning up devices to issue these commands, or worse, doing full EH and a ATA bus reset to wake up a detached/sleeping drive only to issue a unload head/spin down/sleep command to it... Also, doing a head unload on reboot would not be nice, as minimizing head unloads will increase device lifespan (especially on desktop drives), therefore whatever is done to fix the need for head unloads on poweroff issue should not cause head unloads or platter spindown on system restart, just on power off/halt. Indeed multipath causes several problems for the above. I'd say the only way to deal with multipath sanely would be to always assume a single path by default for the purposes of the above problems, and to have userspace interfaces capable of telling the kernel that a device is multipath, all the local paths (so that the kernel knows that all disc-specific device state for these paths is shared, even if bus states are not), and whether there are external paths or not. For all I know, most of this is already done by the multipath suite (I have no experience with multipath). Still, for the issues related to this bug report, multipath devices without external paths should probably have all the handling of a single path device, except that the kernel would need to avoid issuing the same set of commands to the device over the other local paths. It would be also EXTREMELY nice to know from userspace whether we need to attempt to issue cache flushes and disk standby commands on the halt(8) command or not (because the running kernel will do it). I'd would *very* much like to remove the utter gross crap in halt.c that deals with it in two or three years... is there a way to query the kernel, directly or indirectly, for this capability?
My memory about sd used to stopping disks seems to be fabricated. I'm probably confusing with a discussion thread on linux-scsi. Anyways, I'm attaching a patch. It's against 2.6.20-rc5 but should also work with 2.6.19. The followings are added by the patch. * /sys/module/sd_mod/parameters/stop_on_shutdown_default * /sys/class/scsi_disk/h:c:i:l/stop_on_shutdown stop_on_shutdown_default defaults to zero and stop_on_shutdown's are initialized from it. To enable stop_on_shutdown globally, you first set stop_on_shutdown_default to 1 then set all present stop_on_shutdown to 1. If stop_on_shutdown for a disk is 1, sd will issue SCSI START_STOP command with START_VALID and START == 0 which tells the disk to stop on shutdown except for restart. As before, SYNCHRONIZE_CACHE is issued unconditionally which does not usually cause spinup on a spun down device. Would this be enough? If you ACK this, I'll try to push this mainline. Thanks.
Created attachment 10122 [details] implement-sd-stop-on-shutdown
Yes, it is enough. I would still prefer to have /sys/module/sd_mod/parameters/stop_on_shutdown_default set to 1 by default, as multipath is the less common setup (it is, in fact, rare when compared with non-multipath setups now that scsi also means SATA, and soon, PATA), and stop_on_shutdown_default=1 is the correct configuration for anything *but* multi-host multipath setups. But we can certainly set /sys/module/sd_mod/parameters/stop_on_shutdown_default properly in userspace, and it is *extremely* better than what we have now. Acked-by: Henrique de Moraes Holschuh <hmh@hmh.eng.br>
BTW: Thank you very much for the fast response to this bug :-)
You're welcome. :-) I agree that it would be nicer to have 1 as the default 'default' value but I'm afraid we can't do that as that would change kernel behavior which userland might be relying on. Well, it's not like configuring this during boot is difficult. I'll submit the patch. Thanks.
Oh.. BTW, s/multipath/multi initiator/.
One question: What happened to this patch? I have a T60 here (AHCI) and needed to include this patch manually to be able to switch the device off cleanly (without SMART values disk_shift and power-off_retract_count increased).
SCSI part is committed into mainline. libata part is still pending. This can appear in mainline 2.6.22 at the earliest.
Thanks for your quick answer. Hope I don't bother you if I ask: I use linux-2.6.21-rc6-git4 here and had to apply the scsi part manually. When will this patch be applied to the source code (I know that I don't know the exact procedure for patch inclusion - sorry for that)
It's in scsi-misc-2.6 tree that the SCSI maintainer maintains. It was late for 2.6.21 inclusion and will be merged into mainline during 2.6.22-rc1 cycle.
This issue has been dealt with for a long time now. Any remaining fallout is being tracked separately on its own bug entries, so I am marking this bug closed.