Since Kernel 6.1.64 the system can't resume from S3 suspend for SSD SATA OPAL configured with sedutil.
This bug seems similar to old and fixed bug https://bugzilla.kernel.org/show_bug.cgi?id=211227
This bug report lacks too many details, it's unlikely that anyone will look into this. And even if: most likely you need to perform a bisection to locate the change that broke things. Here is a guide that explains how to do this: https://www.leemhuis.info/files/misc/How%20to%20bisect%20a%20Linux%20kernel%20regression%20%e2%80%94%20The%20Linux%20Kernel%20documentation.html
Thank you so much Thorsten! I will try to bisect (a big problem is most of the time my computer shuts down while trying to compile the kernel because of high temperatures even using `make -j1`). Meanwhile, is there a guide on which information would be useful to post here?
On Debian I was asked to post the boot logs with journalctl, which can be viewed here: https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=1058890 The sleep and wake up works normally for every component but for unencrypting the disk (hardware OPAL SED). Debian Stable 12 x86_64 Old notebook Intel Core™ i7-2720QM and Nvidia GeForce® GTX 560M, and I'm using a self encrypting SSD configured with sedutil. I boot with the parameter libata.allow_tpm=1 and I have a systemd service to store the password in the kernel allowing it to decrypt the disk on waking up. Problem: since linux kernel 6.1.64 (which correspond to Debian linux-image-6.1.0-14-amd64 through 15, 16, 17 and 18) the system is unable to fully wake up from suspend. Most of the time it wakes up to a black screen and CAPS LOCK led doesn't change when pressing the CAPS LOCK button. Sometimes the monitor turns on and I can login in a tty but no command ever works. Not even `reboot` `shutdown` etc. Regardless if the monitor turns on, I can shutdown with Alt + SysRq + B. What I've tested: 1) Adding: nomodeset, init_on_alloc=0, intel_iommu=off to boot the kernel. No change at all. 2) I tested suspend by writing to /sys/power/pm_test the following successively "freezer", "devices", "platform", "processors", "core". Every single test was successful, ie, the system wakes up on its own after 5 seconds. 3) `echo mem > /sys/power/state"` instead of `systemctl suspend`. No change. 4) Changing /etc/systemd/sleep.conf to have "SuspendState=standby" or "SuspendState=freeze" seems to make things worse because the notebook seems unable to enter sleep mode, while "SuspendState=mem" causes the same behavior as reported here. 5) I recompiled sedutil after booting linux-image-6.1.0-18-amd64 (git clone --branch s3-sleep-support https://github.com/badicsalex/sedutil.git). It made no difference.
(In reply to desgua from comment #3) > Thank you so much Thorsten! I will try to bisect (a big problem is most of > the time my computer shuts down while trying to compile the kernel because > of high temperatures even using `make -j1`). Sorry, but with such a big problem that system might be unsuitable for reporting bugs, because a problem like this can lead to all sort of trouble. > Meanwhile, is there a guide on > which information would be useful to post here? https://docs.kernel.org/admin-guide/reporting-issues.html has a few, but in the end it depends a lot on the problem and the subsystem which information would be useful. Note, that document will tell you that this is most likely the wrong place for the report, as the right developers might not even see it here.
Thank you so much Thorsten! <3
Created attachment 305971 [details] bisect the bad commit
I managed to bisect and find the bad commit. I attached a file above, and I will paste the bisect log below: git bisect start # status: waiting for both good and bad commits # good: [2dde18cd1d8fac735875f2e4987f11817cc0bc2c] Linux 6.5 git bisect good 2dde18cd1d8fac735875f2e4987f11817cc0bc2c # status: waiting for bad commit, 1 good commit known # bad: [ffc253263a1375a65fa6c9f62a893e9767fbebfa] Linux 6.6 git bisect bad ffc253263a1375a65fa6c9f62a893e9767fbebfa # skip: [a1c19328a160c80251868dbd80066dce23d07995] Merge tag 'soc-arm-6.6' of git://git.kernel.org/pub/scm/linux/kernel/git/soc/soc git bisect skip a1c19328a160c80251868dbd80066dce23d07995 # good: [7c3f204e544dfa376bf1b34ebaa5552304a2b7d9] perf/smmuv3: Remove build dependency on ACPI git bisect good 7c3f204e544dfa376bf1b34ebaa5552304a2b7d9 # good: [01a7eb3e20994701700631ec30462087c4ecf142] mm: fix clean_record_shared_mapping_range kernel-doc git bisect good 01a7eb3e20994701700631ec30462087c4ecf142 # good: [0e72db77672ff4758a31fb5259c754a7bb229751] Merge tag 'soc-dt-6.6' of git://git.kernel.org/pub/scm/linux/kernel/git/soc/soc git bisect good 0e72db77672ff4758a31fb5259c754a7bb229751 # good: [0e72db77672ff4758a31fb5259c754a7bb229751] Merge tag 'soc-dt-6.6' of git://git.kernel.org/pub/scm/linux/kernel/git/soc/soc git bisect good 0e72db77672ff4758a31fb5259c754a7bb229751 # good: [0e72db77672ff4758a31fb5259c754a7bb229751] Merge tag 'soc-dt-6.6' of git://git.kernel.org/pub/scm/linux/kernel/git/soc/soc git bisect good 0e72db77672ff4758a31fb5259c754a7bb229751 # good: [5eea5820c7340d39e56e169e1b87199391105f6b] Merge tag 'mm-stable-2023-09-04-14-00' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm git bisect good 5eea5820c7340d39e56e169e1b87199391105f6b # good: [b300c0fdf0045ede109a349aa9c79f81bfae086a] Merge tag 'hwmon-for-v6.6-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/groeck/linux-staging git bisect good b300c0fdf0045ede109a349aa9c79f81bfae086a # bad: [7e20d344b53532adf60d77cb41873ebdb4f80cf4] Merge tag 'x86-urgent-2023-10-08' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip git bisect bad 7e20d344b53532adf60d77cb41873ebdb4f80cf4 # good: [94b7ed384fa9d397ff0aabff76a8de2f7e107144] Merge tag 'for-v6.6-rc' of git://git.kernel.org/pub/scm/linux/kernel/git/sre/linux-power-supply git bisect good 94b7ed384fa9d397ff0aabff76a8de2f7e107144 # bad: [cb84fb87f325ecd46be586b62623db5b2c0a792e] Merge tag 'integrity-v6.6-fix' of git://git.kernel.org/pub/scm/linux/kernel/git/zohar/linux-integrity git bisect bad cb84fb87f325ecd46be586b62623db5b2c0a792e # bad: [e402b08634b398e9feb94902c7adcf05bb8ba47d] Merge tag 'soc-fixes-6.6' of git://git.kernel.org/pub/scm/linux/kernel/git/soc/soc git bisect bad e402b08634b398e9feb94902c7adcf05bb8ba47d # bad: [ae213639983a5406849d62d33257dfc076bc48a7] Merge tag 'nfsd-6.6-2' of git://git.kernel.org/pub/scm/linux/kernel/git/cel/linux git bisect bad ae213639983a5406849d62d33257dfc076bc48a7 # good: [eafdc5071351314702175a3cd083cf6f7eef6488] Merge tag 'block-6.6-2023-09-28' of git://git.kernel.dk/linux git bisect good eafdc5071351314702175a3cd083cf6f7eef6488 # bad: [95289e49f0a05f729a9ff86243c9aff4f34d4041] Merge tag 'ata-6.6-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/dlemoal/libata git bisect bad 95289e49f0a05f729a9ff86243c9aff4f34d4041 # bad: [3cc2ffe5c16dc65dfac354bc5b5bc98d3b397567] scsi: sd: Differentiate system and runtime start/stop management git bisect bad 3cc2ffe5c16dc65dfac354bc5b5bc98d3b397567 # good: [3b8e0af4a7a331d1510e963b8fd77e2fca0a77f1] ata: libata-core: Fix ata_port_request_pm() locking git bisect good 3b8e0af4a7a331d1510e963b8fd77e2fca0a77f1 # good: [84d76529c650f887f1e18caee72d6f0589e1baf9] ata: libata-core: Fix port and device removal git bisect good 84d76529c650f887f1e18caee72d6f0589e1baf9 # good: [fb99ef17865035a6657786d4b2af11a27ba23f9b] ata: libata-scsi: link ata port and scsi device git bisect good fb99ef17865035a6657786d4b2af11a27ba23f9b # first bad commit: [3cc2ffe5c16dc65dfac354bc5b5bc98d3b397567] scsi: sd: Differentiate system and runtime start/stop management
thx, great. There is one small but kinda important details missing: did you test if a recent 6.8-rc kernel is still affected? I assume you did confirm this, but would be good to be sure.
I just tested 6.8-rc7 from 2024-03-03 and unfortunately it still has this regression.
Thx, forwarded by mail: https://lore.kernel.org/regressions/1b6130bc-69c5-4683-86d1-5ff631da3f80@leemhuis.info/
> The sleep and wake up works normally for every component but for > unencrypting the disk (hardware OPAL SED). I think I understand what is going on here: commit 3cc2ffe5c16dc65dfac354bc5b5bc98d3b397567 changed sd_resume to do nothing and delegate the disk resume to libata, as it should, because we cannot issue any command, even START STOP UNIT, unless the ata port and the device are first resumed. This is the issue that is likely causing what you described (black screen on resume etc). However, the change of commit 3cc2ffe5c16dc65dfac354bc5b5bc98d3b397567 also causes opal_unlock_from_suspend(sdkp->opal_dev) to *NOT* be called, thus leaving the drive locked after resume, so unusable. So I think your bisect is correct and we have a bug. However, this is not trivial to fix because as mentioned above, we must first wait for the ata port and device to be resumed before attempting to access the drive. So I will need some brainstorming to come up with a fix. Give me a couple of days please (I do have SED OPAL drive so I should be able to reproduce this issue).
Thank you so much Damien! <3 If there is anything I can do to help, count on me.
(In reply to desgua from comment #13) > Thank you so much Damien! <3 If there is anything I can do to help, count on > me. Can you try this patch: diff --git a/drivers/scsi/sd.c b/drivers/scsi/sd.c index bdd0acf7fa3c..5e326facc01c 100644 --- a/drivers/scsi/sd.c +++ b/drivers/scsi/sd.c @@ -1723,6 +1723,17 @@ static void sd_rescan(struct device *dev) { struct scsi_disk *sdkp = dev_get_drvdata(dev); + /* + * ATA drives with TCG OPAL, are not unlocked by sd_resume() + * as the device link may still be down when that function is called. + * Since libata calls scsi_rescan_device() once the device link is + * re-established, unlock the device here. + */ + if (sdkp->need_opal_unlock_from_suspend) { + opal_unlock_from_suspend(sdkp->opal_dev); + sdkp->need_opal_unlock_from_suspend = 0; + } + sd_revalidate_disk(sdkp->disk); } @@ -3954,6 +3965,9 @@ static int sd_resume(struct device *dev, bool runtime) return 0; if (!sd_do_start_stop(sdkp->device, runtime)) { + /* Delay unlocking the device (see sd_rescan()) */ + if (sdkp->opal_dev) + sdkp->need_opal_unlock_from_suspend = 1; sdkp->suspended = false; return 0; } diff --git a/drivers/scsi/sd.h b/drivers/scsi/sd.h index 409dda5350d1..1db872288a9d 100644 --- a/drivers/scsi/sd.h +++ b/drivers/scsi/sd.h @@ -151,6 +151,7 @@ struct scsi_disk { unsigned urswrz : 1; unsigned security : 1; unsigned ignore_medium_access_errors : 1; + unsigned need_opal_unlock_from_suspend : 1; }; #define to_scsi_disk(obj) container_of(obj, struct scsi_disk, disk_dev)
YES! I patched v6.8 and it fix the problem. I was able to suspend to S3 and resume normally.
(In reply to desgua from comment #15) > YES! I patched v6.8 and it fix the problem. I was able to suspend to S3 and > resume normally. Excellent. So this confirms the root cause. However, I am not super happy with the fix as it is. I am preparing a different one with the same effect. I will ask you to test this new patch to confirm it works as well. I will send that Monday. Thanks for the testing !
Thank you so much Damien!
Maybe this can help: NVMe is reported to be working fine (https://github.com/Drive-Trust-Alliance/sedutil/issues/90), although I don't know if NVMe driver differentiate system and runtime start/stop management.
(In reply to desgua from comment #18) > Maybe this can help: NVMe is reported to be working fine > (https://github.com/Drive-Trust-Alliance/sedutil/issues/90), although I > don't know if NVMe driver differentiate system and runtime start/stop > management. Thanks, but the problem is well understood and looking at the nvme driver will be of no help here. The issue comes from the mess that ATA device suspend/resume is, which is not really due to bad programming but rather to the peculiarities of AHCI/SATA device handling. The patch I sent above confirms the root cause and what needs to be done. I am preparing a better patch as what I sent is really a hack...
Created attachment 306002 [details] Candidate fix patch v1
(In reply to desgua from comment #17) > Thank you so much Damien! Can you please try the candidate fix patch v1 I attached to this case ? Please make sure to apply it after removing the previous changes you tried. Thanks !
I ran: git reset --hard origin/master git pull origin master git checkout v6.8 And manually checked drivers/scsi/sd.c and sd.h before applying fix patch v1. And YES patch v1 works fine and the computer is able to resume from suspend to S3. Thank you so much Damien!