Bug 218538 - [BISECTED][REGRESSION] Kernel 6.6 and later breaks S3 resume on SATA SSD OPAL
Summary: [BISECTED][REGRESSION] Kernel 6.6 and later breaks S3 resume on SATA SSD OPAL
Status: NEW
Alias: None
Product: IO/Storage
Classification: Unclassified
Component: Serial ATA (show other bugs)
Hardware: Other Linux
: P3 normal
Assignee: Tejun Heo
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2024-02-28 18:42 UTC by desgua
Modified: 2024-03-18 18:22 UTC (History)
5 users (show)

See Also:
Kernel Version: 6.6 and later
Subsystem:
Regression: Yes
Bisected commit-id:


Attachments
bisect the bad commit (3.86 KB, text/plain)
2024-03-09 20:08 UTC, desgua
Details
Candidate fix patch v1 (6.71 KB, patch)
2024-03-18 06:45 UTC, Damien Le Moal
Details | Diff

Description desgua 2024-02-28 18:42:55 UTC
Since Kernel 6.1.64 the system can't resume from S3 suspend for SSD SATA OPAL configured with sedutil.
Comment 1 desgua 2024-02-28 18:47:05 UTC
This bug seems similar to old and fixed bug https://bugzilla.kernel.org/show_bug.cgi?id=211227
Comment 2 The Linux kernel's regression tracker (Thorsten Leemhuis) 2024-03-04 14:04:26 UTC
This bug report lacks too many details, it's unlikely that anyone will look into this. And even if: most likely you need to perform a bisection to locate the change that broke things. Here is a guide that explains how to do this: https://www.leemhuis.info/files/misc/How%20to%20bisect%20a%20Linux%20kernel%20regression%20%e2%80%94%20The%20Linux%20Kernel%20documentation.html
Comment 3 desgua 2024-03-04 16:42:11 UTC
Thank you so much Thorsten! I will try to bisect (a big problem is most of the time my computer shuts down while trying to compile the kernel because of high temperatures even using `make -j1`). Meanwhile, is there a guide on which information would be useful to post here?
Comment 4 desgua 2024-03-04 16:48:52 UTC
On Debian I was asked to post the boot logs with journalctl, which can be viewed here: https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=1058890 

The sleep and wake up works normally for every component but for unencrypting the disk (hardware OPAL SED). 

Debian Stable 12 x86_64
Old notebook Intel Core™ i7-2720QM and Nvidia GeForce® GTX 560M, and I'm using a self encrypting SSD configured with sedutil.
I boot with the parameter libata.allow_tpm=1 and I have a systemd service to store the password in the kernel allowing it to decrypt the disk on waking up.

Problem: since linux kernel 6.1.64 (which correspond to Debian linux-image-6.1.0-14-amd64 through 15, 16, 17 and 18) the system is unable to fully wake up from suspend. Most of the time it wakes up to a black screen and CAPS LOCK led doesn't change when pressing the CAPS LOCK button. Sometimes the monitor turns on and I can login in a tty but no command ever works. Not even `reboot` `shutdown` etc. Regardless if the monitor turns on, I can shutdown with Alt + SysRq + B.


What I've tested:

1) Adding: nomodeset, init_on_alloc=0, intel_iommu=off to boot the kernel. No change at all.

2) I tested suspend by writing to /sys/power/pm_test the following successively "freezer", "devices", "platform", "processors", "core". Every single test was successful, ie, the system wakes up on its own after 5 seconds.

3) `echo mem > /sys/power/state"` instead of `systemctl suspend`. No change.

4) Changing /etc/systemd/sleep.conf to have "SuspendState=standby" or "SuspendState=freeze" seems to make things worse because the notebook seems unable to enter sleep mode, while "SuspendState=mem" causes the same behavior as reported here.

5) I recompiled sedutil after booting linux-image-6.1.0-18-amd64 (git clone --branch s3-sleep-support https://github.com/badicsalex/sedutil.git). It made no difference.
Comment 5 The Linux kernel's regression tracker (Thorsten Leemhuis) 2024-03-05 08:20:00 UTC
(In reply to desgua from comment #3)
> Thank you so much Thorsten! I will try to bisect (a big problem is most of
> the time my computer shuts down while trying to compile the kernel because
> of high temperatures even using `make -j1`).

Sorry, but with such a big problem that system might be unsuitable for reporting bugs, because a problem like this can lead to all sort of trouble.

> Meanwhile, is there a guide on
> which information would be useful to post here?

https://docs.kernel.org/admin-guide/reporting-issues.html has a few, but in the end it depends a lot on the problem and the subsystem which information would be useful. Note, that document will tell you that this is most likely the wrong place for the report, as the right developers might not even see it here.
Comment 6 desgua 2024-03-05 11:17:55 UTC
Thank you so much Thorsten! <3
Comment 7 desgua 2024-03-09 20:08:21 UTC
Created attachment 305971 [details]
bisect the bad commit
Comment 8 desgua 2024-03-09 20:09:43 UTC
I managed to bisect and find the bad commit. I attached a file above, and I will paste the bisect log below:

git bisect start
# status: waiting for both good and bad commits
# good: [2dde18cd1d8fac735875f2e4987f11817cc0bc2c] Linux 6.5
git bisect good 2dde18cd1d8fac735875f2e4987f11817cc0bc2c
# status: waiting for bad commit, 1 good commit known
# bad: [ffc253263a1375a65fa6c9f62a893e9767fbebfa] Linux 6.6
git bisect bad ffc253263a1375a65fa6c9f62a893e9767fbebfa
# skip: [a1c19328a160c80251868dbd80066dce23d07995] Merge tag 'soc-arm-6.6' of git://git.kernel.org/pub/scm/linux/kernel/git/soc/soc
git bisect skip a1c19328a160c80251868dbd80066dce23d07995
# good: [7c3f204e544dfa376bf1b34ebaa5552304a2b7d9] perf/smmuv3: Remove build dependency on ACPI
git bisect good 7c3f204e544dfa376bf1b34ebaa5552304a2b7d9
# good: [01a7eb3e20994701700631ec30462087c4ecf142] mm: fix clean_record_shared_mapping_range kernel-doc
git bisect good 01a7eb3e20994701700631ec30462087c4ecf142
# good: [0e72db77672ff4758a31fb5259c754a7bb229751] Merge tag 'soc-dt-6.6' of git://git.kernel.org/pub/scm/linux/kernel/git/soc/soc
git bisect good 0e72db77672ff4758a31fb5259c754a7bb229751
# good: [0e72db77672ff4758a31fb5259c754a7bb229751] Merge tag 'soc-dt-6.6' of git://git.kernel.org/pub/scm/linux/kernel/git/soc/soc
git bisect good 0e72db77672ff4758a31fb5259c754a7bb229751
# good: [0e72db77672ff4758a31fb5259c754a7bb229751] Merge tag 'soc-dt-6.6' of git://git.kernel.org/pub/scm/linux/kernel/git/soc/soc
git bisect good 0e72db77672ff4758a31fb5259c754a7bb229751
# good: [5eea5820c7340d39e56e169e1b87199391105f6b] Merge tag 'mm-stable-2023-09-04-14-00' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
git bisect good 5eea5820c7340d39e56e169e1b87199391105f6b
# good: [b300c0fdf0045ede109a349aa9c79f81bfae086a] Merge tag 'hwmon-for-v6.6-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/groeck/linux-staging
git bisect good b300c0fdf0045ede109a349aa9c79f81bfae086a
# bad: [7e20d344b53532adf60d77cb41873ebdb4f80cf4] Merge tag 'x86-urgent-2023-10-08' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
git bisect bad 7e20d344b53532adf60d77cb41873ebdb4f80cf4
# good: [94b7ed384fa9d397ff0aabff76a8de2f7e107144] Merge tag 'for-v6.6-rc' of git://git.kernel.org/pub/scm/linux/kernel/git/sre/linux-power-supply
git bisect good 94b7ed384fa9d397ff0aabff76a8de2f7e107144
# bad: [cb84fb87f325ecd46be586b62623db5b2c0a792e] Merge tag 'integrity-v6.6-fix' of git://git.kernel.org/pub/scm/linux/kernel/git/zohar/linux-integrity
git bisect bad cb84fb87f325ecd46be586b62623db5b2c0a792e
# bad: [e402b08634b398e9feb94902c7adcf05bb8ba47d] Merge tag 'soc-fixes-6.6' of git://git.kernel.org/pub/scm/linux/kernel/git/soc/soc
git bisect bad e402b08634b398e9feb94902c7adcf05bb8ba47d
# bad: [ae213639983a5406849d62d33257dfc076bc48a7] Merge tag 'nfsd-6.6-2' of git://git.kernel.org/pub/scm/linux/kernel/git/cel/linux
git bisect bad ae213639983a5406849d62d33257dfc076bc48a7
# good: [eafdc5071351314702175a3cd083cf6f7eef6488] Merge tag 'block-6.6-2023-09-28' of git://git.kernel.dk/linux
git bisect good eafdc5071351314702175a3cd083cf6f7eef6488
# bad: [95289e49f0a05f729a9ff86243c9aff4f34d4041] Merge tag 'ata-6.6-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/dlemoal/libata
git bisect bad 95289e49f0a05f729a9ff86243c9aff4f34d4041
# bad: [3cc2ffe5c16dc65dfac354bc5b5bc98d3b397567] scsi: sd: Differentiate system and runtime start/stop management
git bisect bad 3cc2ffe5c16dc65dfac354bc5b5bc98d3b397567
# good: [3b8e0af4a7a331d1510e963b8fd77e2fca0a77f1] ata: libata-core: Fix ata_port_request_pm() locking
git bisect good 3b8e0af4a7a331d1510e963b8fd77e2fca0a77f1
# good: [84d76529c650f887f1e18caee72d6f0589e1baf9] ata: libata-core: Fix port and device removal
git bisect good 84d76529c650f887f1e18caee72d6f0589e1baf9
# good: [fb99ef17865035a6657786d4b2af11a27ba23f9b] ata: libata-scsi: link ata port and scsi device
git bisect good fb99ef17865035a6657786d4b2af11a27ba23f9b
# first bad commit: [3cc2ffe5c16dc65dfac354bc5b5bc98d3b397567] scsi: sd: Differentiate system and runtime start/stop management
Comment 9 The Linux kernel's regression tracker (Thorsten Leemhuis) 2024-03-10 07:27:35 UTC
thx, great. There is one small but kinda important details missing: did you test if a recent 6.8-rc kernel is still affected? I assume you did confirm this, but would be good to be sure.
Comment 10 desgua 2024-03-10 11:52:05 UTC
I just tested 6.8-rc7 from 2024-03-03 and unfortunately it still has this regression.
Comment 11 The Linux kernel's regression tracker (Thorsten Leemhuis) 2024-03-11 13:27:28 UTC
Thx, forwarded by mail: https://lore.kernel.org/regressions/1b6130bc-69c5-4683-86d1-5ff631da3f80@leemhuis.info/
Comment 12 Damien Le Moal 2024-03-13 07:14:05 UTC
> The sleep and wake up works normally for every component but for
> unencrypting the disk (hardware OPAL SED). 

I think I understand what is going on here: commit 3cc2ffe5c16dc65dfac354bc5b5bc98d3b397567 changed sd_resume to do nothing and delegate the disk resume to libata, as it should, because we cannot issue any command, even START STOP UNIT, unless the ata port and the device are first resumed. This is the issue that is likely causing what you described (black screen on resume etc).

However, the change of commit 3cc2ffe5c16dc65dfac354bc5b5bc98d3b397567 also causes opal_unlock_from_suspend(sdkp->opal_dev) to *NOT* be called, thus leaving the drive locked after resume, so unusable.

So I think your bisect is correct and we have a bug. However, this is not trivial to fix because as mentioned above, we must first wait for the ata port and device to be resumed before attempting to access the drive. So I will need some brainstorming to come up with a fix. Give me a couple of days please (I do have SED OPAL drive so I should be able to reproduce this issue).
Comment 13 desgua 2024-03-13 11:01:02 UTC
Thank you so much Damien! <3 If there is anything I can do to help, count on me.
Comment 14 Damien Le Moal 2024-03-15 05:23:21 UTC
(In reply to desgua from comment #13)
> Thank you so much Damien! <3 If there is anything I can do to help, count on
> me.

Can you try this patch:

diff --git a/drivers/scsi/sd.c b/drivers/scsi/sd.c
index bdd0acf7fa3c..5e326facc01c 100644
--- a/drivers/scsi/sd.c
+++ b/drivers/scsi/sd.c
@@ -1723,6 +1723,17 @@ static void sd_rescan(struct device *dev)
 {
        struct scsi_disk *sdkp = dev_get_drvdata(dev);
 
+       /*
+        * ATA drives with TCG OPAL, are not unlocked by sd_resume()
+        * as the device link may still be down when that function is called.
+        * Since libata calls scsi_rescan_device() once the device link is
+        * re-established, unlock the device here.
+        */
+       if (sdkp->need_opal_unlock_from_suspend) {
+               opal_unlock_from_suspend(sdkp->opal_dev);
+               sdkp->need_opal_unlock_from_suspend = 0;
+       }
+
        sd_revalidate_disk(sdkp->disk);
 }
 
@@ -3954,6 +3965,9 @@ static int sd_resume(struct device *dev, bool runtime)
                return 0;
 
        if (!sd_do_start_stop(sdkp->device, runtime)) {
+               /* Delay unlocking the device (see sd_rescan()) */
+               if (sdkp->opal_dev)
+                       sdkp->need_opal_unlock_from_suspend = 1;
                sdkp->suspended = false;
                return 0;
        }
diff --git a/drivers/scsi/sd.h b/drivers/scsi/sd.h
index 409dda5350d1..1db872288a9d 100644
--- a/drivers/scsi/sd.h
+++ b/drivers/scsi/sd.h
@@ -151,6 +151,7 @@ struct scsi_disk {
        unsigned        urswrz : 1;
        unsigned        security : 1;
        unsigned        ignore_medium_access_errors : 1;
+       unsigned        need_opal_unlock_from_suspend : 1;
 };
 #define to_scsi_disk(obj) container_of(obj, struct scsi_disk, disk_dev)
Comment 15 desgua 2024-03-15 14:14:46 UTC
YES! I patched v6.8 and it fix the problem. I was able to suspend to S3 and resume normally.
Comment 16 Damien Le Moal 2024-03-16 01:38:08 UTC
(In reply to desgua from comment #15)
> YES! I patched v6.8 and it fix the problem. I was able to suspend to S3 and
> resume normally.

Excellent. So this confirms the root cause. However, I am not super happy with the fix as it is. I am preparing a different one with the same effect. I will ask you to test this new patch to confirm it works as well. I will send that Monday. Thanks for the testing !
Comment 17 desgua 2024-03-16 10:09:08 UTC
Thank you so much Damien!
Comment 18 desgua 2024-03-17 09:47:36 UTC
Maybe this can help: NVMe is reported to be working fine (https://github.com/Drive-Trust-Alliance/sedutil/issues/90), although I don't know if NVMe driver differentiate system and runtime start/stop management.
Comment 19 Damien Le Moal 2024-03-17 23:39:18 UTC
(In reply to desgua from comment #18)
> Maybe this can help: NVMe is reported to be working fine
> (https://github.com/Drive-Trust-Alliance/sedutil/issues/90), although I
> don't know if NVMe driver differentiate system and runtime start/stop
> management.

Thanks, but the problem is well understood and looking at the nvme driver will be of no help here. The issue comes from the mess that ATA device suspend/resume is, which is not really due to bad programming but rather to the peculiarities of AHCI/SATA device handling.

The patch I sent above confirms the root cause and what needs to be done. I am preparing a better patch as what I sent is really a hack...
Comment 20 Damien Le Moal 2024-03-18 06:45:49 UTC
Created attachment 306002 [details]
Candidate fix patch v1
Comment 21 Damien Le Moal 2024-03-18 06:46:56 UTC
(In reply to desgua from comment #17)
> Thank you so much Damien!

Can you please try the candidate fix patch v1 I attached to this case ?
Please make sure to apply it after removing the previous changes you tried.

Thanks !
Comment 22 desgua 2024-03-18 12:47:25 UTC
I ran: 

git reset --hard origin/master
git pull origin master
git checkout v6.8 
And manually checked drivers/scsi/sd.c and sd.h before applying fix patch v1.

And YES patch v1 works fine and the computer is able to resume from suspend to S3. 

Thank you so much Damien!

Note You need to log in before you can comment on or make changes to this bug.