Bug 151631 - "Synchronizing SCSI cache" fails during(and delays) reboot/shutdown
Summary: "Synchronizing SCSI cache" fails during(and delays) reboot/shutdown
Status: NEW
Alias: None
Product: IO/Storage
Classification: Unclassified
Component: SCSI (show other bugs)
Hardware: All Linux
: P1 normal
Assignee: linux-scsi@vger.kernel.org
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2016-08-06 17:17 UTC by elvis has left the building
Modified: 2024-01-22 08:40 UTC (History)
6 users (show)

See Also:
Kernel Version: <=4.7.0-g0cbbc42 4.7.0-g52ddb7e
Subsystem:
Regression: Yes
Bisected commit-id:


Attachments
for systemd, /usr/lib/systemd/system-shutdown/debug.sh (1.50 KB, text/plain)
2016-08-06 19:57 UTC, elvis has left the building
Details
Patch reverting commit 2c85025c75dfe7ddc2bb33363a998dad59383f94 (1.56 KB, patch)
2016-11-08 08:11 UTC, Gianpaolo
Details | Diff

Description elvis has left the building 2016-08-06 17:17:15 UTC
During reboot/shutdown,
here's a screenshot with the errors: https://i.imgur.com/3P7ztlj.png

I'm assuming the ohci-pci errors are unrelated, otherwise, the error is(manually written here):
sd 3:0:0:0: [sda] Synchronizing SCSI cache
ata4.00: exception Emask 0x0 SAct 0x0 SErr 0x40d0000 action 0x6 frozen
ata4.00: SError: { PHYRdyChg CommWake 10B8B DevExch }
ata4.00: failed command: FLUSH CACHE EXT
ata4.00: cmd ea/00:00:00:00:00/00:00:00:00:00/a0 tag 22
         res 40/00:01:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
ata4.00: status { DRDY }
ata4: hard resetting link
ata4: softreset failed (1st FIS failed)
ata4: hard resetting link
...

After 2 minutes or so, the reboot/shutdown continues okay.

I've tried hdparm -F /dev/sda and hdparm -f /dev/sda during normal system function, they both seem to work as there are no errors reported on console/dmesg/journalctl.
This seems to tell me that Synchronizing SCSI cache should work too, unless who knows why they aren't working?maybe something shutdown some laptop parts(like the SSD itself) before reaching that.


Some info about what ata4 is:

[    4.728132] ata4: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
[    4.730133] ata4.00: supports DRM functions and may not be fully accessible
[    4.730603] ata4.00: NCQ Send/Recv Log not supported
[    4.730606] ata4.00: ATA-9: Samsung SSD 840 EVO 1TB, EXT0BB6Q, max UDMA/133
[    4.730609] ata4.00: 1953525168 sectors, multi 1: LBA48 NCQ (depth 31/32), AA
[    4.730897] ata4.00: supports DRM functions and may not be fully accessible
[    4.731400] ata4.00: NCQ Send/Recv Log not supported
[    4.731408] ata4.00: configured for UDMA/133
[    4.733053] scsi 3:0:0:0: Direct-Access     ATA      Samsung SSD 840  BB6Q PQ: 0 ANSI: 5

Note this might've worked before so this is a regression(or I just didn't see or have these errors turned on?) with kernel: 4.7.0-rc6-ga99cde4
But with both kernels I am/was experiencing the "Unmounting /oldroot" stalling/delay which is here: https://bugzilla.kernel.org/show_bug.cgi?id=151621
but that message happens before(and it's systemd calling umount2 syscall) this issue you're reading now.

file: drivers/scsi/sd.c
in function: sd_shutdown
hangs in call: sd_sync_cache(sdkp);

Cheers
Comment 1 elvis has left the building 2016-08-06 19:20:57 UTC
I tried one shutdown, and after the above it did shutdown, as expected, however, on poweron, SSD was not detected by BIOS unless I did one ctrl+alt+del from POST's attempt to boot from LAN, and this second time it was detected. This happens usually after I powerdown from button(hold 4 sec) because I imagine SSD is busy recovering from sudden-power-loss internally, so I can only conclude that the kernel didn't safely powerdown the SSD(aside from the above failure to flush cache). (Was definitely ok in 4.7.0-rc6-ga99cde4)

I did however have a debug.sh script in /usr/lib/systemd/system-shutdown/debug.sh
which did these:
sync && sdparm --command=sync /dev/sda && sleep 1
mount -o remount,ro /
hdparm -F /dev/sda
hdparm -f /dev/sda
sleep 1

So in a way, that ssd cache was supposedly flushed regardless; and it doesn't seem to me that something else ever tried to remount,rw or do writes after this.
Comment 2 elvis has left the building 2016-08-06 19:57:37 UTC
Created attachment 227801 [details]
for systemd, /usr/lib/systemd/system-shutdown/debug.sh

Found workaround, thanks to this article: https://unix.stackexchange.com/questions/55281/how-to-stop-waking-all-attached-drives-on-reboot-deactivating-swap/55417#55417

If I put the following in aforementioned debug.sh file then reboot is instant(after the SCSI bug, of course):

#THIS:
#turn off drive cache
hdparm -W0 /dev/sda
#^^^^^^^^^^^^^^^^^^^^^^

#this was already here before:
#flush drive cache:
hdparm -F /dev/sda
hdparm -f /dev/sda
Comment 3 elvis has left the building 2016-08-07 00:29:48 UTC
Ok, now that the cache error doesn't happen and reboot works, I've tested shutdown and it now fails at the "Stopping disk" part, which means this line:
sd_start_stop_device(sdkp, 0);

ata4.00: failed command: STANDBY IMMEDIATE

screenshot: https://i.imgur.com/2yvAMSR.jpg

So now we know that it did in fact not safely shutdown the drive before, when I thought it's just the drive cache flushing that failed. So the drive is doing recovery, internally, upon laptop poweron which is why it doesn't identify itself in time but only on next ctrl+alt+del boot. (see comment #1)
Comment 4 elvis has left the building 2016-08-07 15:06:22 UTC
I have worked around this issue(for me) by adding a few lines in debug.sh like:

if test "$1" == "poweroff"; then
 hdparm -Y /dev/sda
 echo o > /proc/sysrq-trigger ; sleep 5
fi

that puts drive to sleep and triggers shutdown via sysrq

Seems to have worked just fine: no drive issues on startup anymore, as I've mentioned before with having to ctrl+alt+del once to get BIOS to detect drive; and no extra delays.

Also the above require a kernel patch:
//sd_start_stop_device(sdkp, 0);
commenting out that call in sd_shutdown function in file: drivers/scsi/sd.c
or else it will fail like in comment #3
Comment 5 Gianpaolo 2016-11-08 08:11:27 UTC
Created attachment 243891 [details]
Patch reverting commit 2c85025c75dfe7ddc2bb33363a998dad59383f94

This patch solved bug https://bugzilla.kernel.org/show_bug.cgi?id=187061 which I suspect being a duuplicate of this bug. If someone affected by this bug could test it and confirm it works (it should work both with v4.8.6 and v4.9-rc4).
Comment 6 Daniele Viganò 2016-11-10 06:54:16 UTC
I can confirm exactly the same bug on my Dell Latitude E5450 with an SSD (Samsung 840) running Fedora 24 and kernel 4.8.6 (4.7.9 was fine).

The patch provided by Gianpaolo solved the issue on my machine.
Comment 7 Rich 2016-12-04 20:22:03 UTC
Very similar (if not same) problem happening on my works PC - a Dell Precision T1700. I will try and get a screenshot and add it to this thread.

The problem began with Kernel 4.8.6 and is still present (currently running Kernel 4.8.11).

Very annoying on shutdown. I have absolutely no idea if the disks' cache is actually flushing or not - so data integrity is a concern. Its a magnetic disk and I have been running fsck on it regularly (no issues found as of yet) and been avoiding shutting the PC down.

Point of this comment: Its still an issue on some PCs. Anyone else finding the same?
Comment 8 Daniele Viganò 2016-12-04 20:26:32 UTC
@Rich, have a look at https://bugzilla.kernel.org/show_bug.cgi?id=187061

Bug has been resolved it 4.9-rc7.
Comment 9 Vyacheslav Dikonov 2017-01-17 20:29:24 UTC
I am experiencing the Synchronize Cache problem with both 4.4.43 and 4.8.17.
My dmesg yields:

[  428.593067] sd 5:0:0:0: [sdd] Synchronizing SCSI cache
[  428.716643] sd 5:0:0:0: [sdd] Synchronize Cache(10) failed: Result: hostbyte=DID_ERROR driverbyte=DRIVER_OK

When I try to attach an external USB3 drive (over an USB2 interface). It work well for the first time, but later, after I unmount and disconnect it, any subsequent attempt to attacj the same hdd results in this bug. The disk cannot be mounted for the second time. Fs is reiserfs. It happens with kernel 4.4.43 too,
Comment 10 Vyacheslav Dikonov 2017-01-17 20:56:20 UTC
One further observation. The drive was attached to an USB2 controller 

00:1d.0 USB controller: Intel Corporation 82801DB/DBL/DBM (ICH4/ICH4-L/ICH4-M) USB UHCI Controller

through an USB hub, which has  both USB2 and USB3 ports. The bug occurs when I try to reattach the hdd to the same type of ports. The HDD works when I change the type of ports (USB2 > USB3 or USB3 > USB2). It happens again if I do not change the type of connection. It might be a different USB-related problem, but the error message is the same "Synchronize Cache(10) failed"
Comment 11 Jesse Stone 2024-01-04 11:45:55 UTC
I met this problem too, before reboot do follow command
echo "1" >  /sys/block/[sdx]/device/delete

remove device can reboot good! test ok
Comment 12 Jesse Stone 2024-01-22 08:40:40 UTC
//if under linux 4.12 (include 4.12) add this patch
void __scsi_remove_device(struct scsi_device *sdev)
{
    struct device *dev = &sdev->sdev_gendev;
    int res;
    int ret;

    /*
     * This cleanup path is not reentrant and while it is impossible
     * to get a new reference with scsi_device_get() someone can still
     * hold a previously acquired one.
     */
    if (sdev->sdev_state == SDEV_DEL)
        return;

    if (sdev->is_visible) {
        /*
         * If scsi_internal_target_block() is running concurrently,
         * wait until it has finished before changing the device state.
         */
        mutex_lock(&sdev->state_mutex);
        /*
         * If blocked, we go straight to DEL and restart the queue so
         * any commands issued during driver shutdown (like sync
         * cache) are errored immediately.
         */
        res = scsi_device_set_state(sdev, SDEV_CANCEL);
        //if (res != 0) {
        ret = scsi_device_set_state(sdev, SDEV_DEL);
        if ( !ret )
        {
            scsi_start_queue(sdev);
        }
        //}
        mutex_unlock(&sdev->state_mutex);

        if (res != 0)
            return;

        bsg_unregister_queue(sdev->request_queue);
        device_unregister(&sdev->sdev_dev);
        transport_remove_device(dev);
        scsi_dh_remove_device(sdev);
        device_del(dev);
    } else {
        put_device(&sdev->sdev_dev);
    }

    /*
     * Stop accepting new requests and wait until all queuecommand() and
     * scsi_run_queue() invocations have finished before tearing down the
     * device.
     */
    mutex_lock(&sdev->state_mutex);
    scsi_device_set_state(sdev, SDEV_DEL);
    mutex_unlock(&sdev->state_mutex);

    blk_cleanup_queue(sdev->request_queue);
    cancel_work_sync(&sdev->requeue_work);

    if (sdev->host->hostt->slave_destroy)
    {
        sdev->host->hostt->slave_destroy(sdev);
    }

    transport_destroy_device(dev);

    /*
     * Paired with the kref_get() in scsi_sysfs_initialize().  We have
     * remoed sysfs visibility from the device, so make the target
     * invisible if this was the last device underneath it.
     */
    scsi_target_reap(scsi_target(sdev));

    put_device(dev);
}

//if above linux 4.13 (include 4.13)
void __scsi_remove_device(struct scsi_device *sdev)
{
	struct device *dev = &sdev->sdev_gendev;
	int res;
#ifdef MY_PATCH	
	int ret;
#endif	

	/*
	 * This cleanup path is not reentrant and while it is impossible
	 * to get a new reference with scsi_device_get() someone can still
	 * hold a previously acquired one.
	 */
	if (sdev->sdev_state == SDEV_DEL)
		return;

	if (sdev->is_visible) {
		/*
		 * If scsi_internal_target_block() is running concurrently,
		 * wait until it has finished before changing the device state.
		 */
		mutex_lock(&sdev->state_mutex);
		/*
		 * If blocked, we go straight to DEL and restart the queue so
		 * any commands issued during driver shutdown (like sync
		 * cache) are errored immediately.
		 */
		res = scsi_device_set_state(sdev, SDEV_CANCEL);
#ifdef MY_PATCH
		ret = scsi_device_set_state(sdev, SDEV_DEL);
		if ( !ret )
			scsi_start_queue(sdev);
#else
		if (res != 0) {
			res = scsi_device_set_state(sdev, SDEV_DEL);
			if (res == 0)
				scsi_start_queue(sdev);
		}
#endif		
		mutex_unlock(&sdev->state_mutex);

		if (res != 0)
			return;

		if (sdev->host->hostt->sdev_groups)
			sysfs_remove_groups(&sdev->sdev_gendev.kobj,
					sdev->host->hostt->sdev_groups);

		bsg_unregister_queue(sdev->request_queue);
		device_unregister(&sdev->sdev_dev);
		transport_remove_device(dev);
		device_del(dev);
	} else
		put_device(&sdev->sdev_dev);

	/*
	 * Stop accepting new requests and wait until all queuecommand() and
	 * scsi_run_queue() invocations have finished before tearing down the
	 * device.
	 */
	mutex_lock(&sdev->state_mutex);
	scsi_device_set_state(sdev, SDEV_DEL);
	mutex_unlock(&sdev->state_mutex);

	blk_cleanup_queue(sdev->request_queue);
	cancel_work_sync(&sdev->requeue_work);

	if (sdev->host->hostt->slave_destroy)
		sdev->host->hostt->slave_destroy(sdev);
	transport_destroy_device(dev);

	/*
	 * Paired with the kref_get() in scsi_sysfs_initialize().  We have
	 * removed sysfs visibility from the device, so make the target
	 * invisible if this was the last device underneath it.
	 */
	scsi_target_reap(scsi_target(sdev));

	put_device(dev);
}

Note You need to log in before you can comment on or make changes to this bug.