Every time I resume from S3 / suspend to disk one particular SATA hard disk disappears. It is always the same disk. Other disks are not affected. I swapped SATA port, power and SATA data cables to no avail. Upon resume from `systemctl suspend` (wake by key press) one of a number of hard disks, here sd 2 / sdb, gives the following error messages (I am giving a few lines of context): ``` [174232.762381] ata5: found unknown device (class 0) [174232.762411] ata6: found unknown device (class 0) [174233.228546] ata3: SATA link down (SStatus 0 SControl 300) [174237.422376] ata6: softreset failed (device not ready) [174237.422391] ata5: softreset failed (device not ready) [174238.290401] ata4: found unknown device (class 0) [174238.450381] ata4: softreset failed (device not ready) [174238.450414] ata1: SATA link up 6.0 Gbps (SStatus 133 SControl 300) [174238.454123] ata1.00: configured for UDMA/133 [174238.608510] ata3: SATA link down (SStatus 0 SControl 300) [174238.608521] ata3.00: disable device [174238.608540] sd 2:0:0:0: rejecting I/O to offline device [174238.608547] ata3.00: detaching (SCSI 2:0:0:0) [174238.608554] sd 2:0:0:0: [sdb] Start/Stop Unit failed: Result: hostbyte=DID_NO_CONNECT driverbyte=DRIVER_OK [174238.608558] sd 2:0:0:0: PM: dpm_run_callback(): scsi_bus_resume+0x0/0x80 returns -5 [174238.608567] sd 2:0:0:0: PM: failed to resume async: error -5 [174238.630402] sd 2:0:0:0: [sdb] Synchronizing SCSI cache [174238.630423] sd 2:0:0:0: [sdb] Synchronize Cache(10) failed: Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK [174238.630424] sd 2:0:0:0: [sdb] Stopping disk [174238.630428] sd 2:0:0:0: [sdb] Start/Stop Unit failed: Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK [174242.774382] ata6: found unknown device (class 0) [174242.774417] ata5: found unknown device (class 0) [174243.174399] ata6: SATA link up 6.0 Gbps (SStatus 133 SControl 300) [174243.266633] ata6.00: configured for UDMA/133 [174243.294401] ata5: SATA link up 6.0 Gbps (SStatus 133 SControl 300) [174243.387415] ata5.00: configured for UDMA/133 [174243.802389] ata4: found unknown device (class 0) [174245.942401] ata4: SATA link up 6.0 Gbps (SStatus 133 SControl 300) [174246.128870] ata4.00: configured for UDMA/133 ``` The disk does NOT reappear after `echo - - - > /sys/class/scsi_host/host3/scan` (tried all `host[0-9]`), `/dev/sdb` stays gone. I am on openSUSE Tumbleweed with frequent updates. The disk worked fine for months until very recently. The disk tests fine with a SMART long test and I backed up 8 TB from it without problems after a complete shutdown and reboot. Hardware: Chipset AMD X570, CPU AMD Ryzen 5 PRO 4650G, disk TOSHIBA MG08ACA16TE. Very likely OK on: 5.16.15-1-default Definitely broken: 5.18.1-1-default I tested the same setup but connecting the disk by USB3. The disk in this case "survives" suspend and resume.
Perhaps related: https://www.linuxquestions.org/questions/linux-hardware-18/sata-disk-problems-up-after-resume-from-memory-4175532754/ (only first half of thread, from 2015, not documented as fixed)
Perhaps related: https://bugzilla.kernel.org/show_bug.cgi?id=76681
I can confirm that this issue still exists in Linux 5.18.2-1-default.
Second hard disk tested: TOSHIBA MG09ACA1, same problem. What distinguishes these affected disks from the 5 other, non affected disks? * both larger than the others (nominal 16 and 18 vs 10 TB) * both from Toshiba (but so is one other though that is old) Hope this helps track the issue.
I can confirm that this issue still exists in Linux 5.18.6-1-default. Since this apparently is a regression I hope it is easy to find and fix.
As you have a working and failing config, can you please bisect this?
OK, I will try installing an old kernel, though that may be difficult (rolling distro). This is a production machine so it may take a few days.
Tested 5.17.9-1-default (oldest one available in openSUSE Tumbleweed History): The problem already exists in this kernel version. The disks that failed are Toshiba MG09ACA18TE and MG08ACA16TE. For example, Toshiba DT01ACA300 does not exhibit the problem. The affected ones are the largest disks in the system. Meanwhile, is there an X570 chipset configuration option that I can try out as a workaround in the BIOS?
> Tested 5.17.9-1-default (oldest one available in openSUSE Tumbleweed > History): The problem already exists in this kernel version. So this isn't a regression you're saying? Ideally what we need to see is a bisect down to an individual commit (which would mean compiling from source https://www.kernel.org/doc/html/latest/admin-guide/bug-bisect.html). > Meanwhile, is there an X570 chipset configuration option that I can try out > as a workaround in the BIOS? You'd have to check with your hardware manufacturer on what is exposed in their BIOS related to SATA.
I believe this is a regression. I have narrowed it down now. Definitely affected: 5.17.1-1-default through 5.18.6-1-default Very likely OK: 5.16.15-1-default So while this is not a perfect disection I hope the changes in the related drivers (perhaps for X570's SATA controller?) are few between 5.16.15 and 5.17.1. Not blaming you (!) but had I been given the request for bisection testing 3.5 weeks ago when I first posted this perhaps I could have accessed enough older kernels to narrow it down much more. Now I extracted the information from past logs, I did: foreach n ( `seq 1 14` ) foreach? journalctl -k -b -$n | egrep '(scsi_bus_resume|@buildhost)' foreach? echo was $n foreach? echo ============= foreach? end 5.16.15-1-default boots did not show the resume problem. I did find traces of suspend (S3) activity but no error. 5.17.1-1-default was the first one that did show errors.
> 5.16.15-1-default boots did not show the resume problem. I did find traces > of suspend (S3) activity but no error. > 5.17.1-1-default was the first one that did show errors. How about checking out kernel sources and do 2x build with tags at 5.16.0 and 5.17.0. If you can fail with 5.17.0 but pass with 5.16.0, you should be able to do proper bisect.
Thanks for the suggestion. However, I am afraid I cannot do that. I cannot deploy a custom kernel on this production machine. However, it is not so bad because the regression occurred between 5.16.15 and 5.17.1 ! The diff should therefore be small, especially focused only on those code files relevant here. Would that be some sort of X570 chipset SATA driver component? I wonder whether it's the disk size or some other feature of the affected disks that hinders wakeup/accessibility of these drives? The drives do spin up during wakeup from S3 but stay lost to the driver. If you point me to the relevant source files I can try to do a static analysis - though the original coder would be much more competent to do that.
> However, it is not so bad because the regression occurred between 5.16.15 and > 5.17.1 ! Unfortunately that is two different stable trees, so it can be a collection of stable commits that caused it, or it could be stuff from the 5.17 merge window. The 5.16.0 to 5.17.0 jump will have thousands of commits. The 5.16.14 to 5.16.15 will be a lot smaller (probably a hundred or so). > If you point me to the relevant source files I can try to do a static > analysis - though the original coder would be much more competent to do that. Assuming it was in drivers/ata it's not exactly a small diff.. $ git diff v5.16 v5.17 drivers/ata | wc -l 6572 Assuming it's a problem in drivers/ata here is how many commits to wade through: $ git log --oneline v5.16..v5.17 drivers/ata | wc -l 113 Sometimes these types of problems are outside the subsystem though. This could easily be something like an ACPICA change that caused it. As an example recently A very innocuous looking change to return -ENODEV in what seemed to be an error path for a GPIO driver completely broke resume from s0i3 for all AMD systems. A bisect really is your best bet.
BTW - is this your ATA controller? I didn't see a log added so I'm not 100% sure. > PCI_VDEVICE(AMD, 0x7801) If so, this might have been caused by a17ab7aba5df4135ef77d7f6d7105e1ea414936f
Please attach the output of `lspci -nn`, and the Linux logs for the last working and first non-working situation.
Created attachment 301383 [details] lspci -nn
(In reply to Paul Menzel from comment #15) > Please attach the output of `lspci -nn`, and the Linux logs for the last > working and first non-working situation. Do you need specific lines, may I grep something please? The files are very large and likely contain personal data: # wc jou* 9096 106183 742028 journalctl-k-b-10-5.17.1-1-default-hdb_lost 1792 21641 162138 journalctl-k-b-11-5.16.15-1-default-hdb_not_lost 1571 18787 139323 journalctl-k-b-12-5.16.15-1-default-hdb_not_lost 1676 19921 146721 journalctl-k-b-13-5.16.15-1-default-hdb_not_lost 14135 166532 1190210 total It may take a long time to sift through these files and remove personal data.
Your SATA controller from lspci output is 0x7901, so it's not caused by a17ab7aba5df4135ef77d7f6d7105e1ea414936f. 1527f69204fe35f341cb599f1cb01bd02daf4374 would be suspect, but that landed in 5.16 which you said 5.16.y is OK.
Likely related to this problem: https://bugzilla.kernel.org/show_bug.cgi?id=215880
(In reply to Nils from comment #17) > (In reply to Paul Menzel from comment #15) > > Please attach the output of `lspci -nn`, and the Linux logs for the last > > working and first non-working situation. > > Do you need specific lines, may I grep something please? The files are very > large and likely contain personal data: > > # wc jou* > 9096 106183 742028 journalctl-k-b-10-5.17.1-1-default-hdb_lost > 1792 21641 162138 journalctl-k-b-11-5.16.15-1-default-hdb_not_lost > 1571 18787 139323 journalctl-k-b-12-5.16.15-1-default-hdb_not_lost > 1676 19921 146721 journalctl-k-b-13-5.16.15-1-default-hdb_not_lost > 14135 166532 1190210 total > > It may take a long time to sift through these files and remove personal data. Hmm, `journalctl -k` should only contain Linux kernel messages with no personal information. It should contain the same messages as the output of `dmesg`.
Damien, thanks for the insight! Nils, can you try the patches at the end of that bug report?
> Nils, can you try the patches at the end of that bug report? I am afraid this is a production machine and I cannot install a custom kernel. Can I detect if I am affected by that particular issue in a running system?
(In reply to Paul Menzel from comment #20) > Hmm, `journalctl -k` should only contain Linux kernel messages with no > personal information. It should contain the same messages as the output of > `dmesg`. I am editing the files now but am finding many serial numbers, names etc. After 20 minutes I am at 17%. This may take a long time. I will have to continue on another day.
Created attachment 301405 [details] journalctl -k -b 13 Linux 5.16.15-1-default - hdb not lost during resume from S3.
Created attachment 301406 [details] journalctl -k -b 10 Linux 5.17.1-1-default - hdb lost during resume from S3.
Would it be possible to test Martin Petersen's for-next branch, which includes a fix for "Resume process hangs for 5-6 seconds starting sometime in 5.16" ("scsi: sd: Rework asynchronous resume support")? See also https://git.kernel.org/pub/scm/linux/kernel/git/mkp/scsi.git/log/?h=for-next
(In reply to Bart Van Assche from comment #26) > Would it be possible to test Martin Petersen's for-next branch, which > includes a fix for "Resume process hangs for 5-6 seconds starting sometime > in 5.16" ("scsi: sd: Rework asynchronous resume support")? See also > https://git.kernel.org/pub/scm/linux/kernel/git/mkp/scsi.git/log/?h=for-next I am afraid this is a production machine and I cannot install a custom kernel. Can I detect if I am affected by that particular issue in a running system?
Now that I have had the time to read all the messages above, the following information looks very interesting to me: Linux 5.16.15-1-default - hdb not lost during resume from S3. Linux 5.17.1-1-default - hdb lost during resume from S3. The following changes from me have been included in the sd driver in each of these kernel versions: $ PAGER= git log --author="Bart Van Assche" --reverse --format=oneline v5.15..v5.16.15 drivers/scsi/sd.c a19a93e4c6a98c9c0f2f5a6db76846f10d7d1f85 scsi: core: pm: Rely on the device driver core for async power management 1c9575326a4a09bdb82c104ed1ab4e932c7a1f49 scsi: sd: Rename sd_resume() into sd_resume_system() 9131bff6a9f1d4c6ab2d4673974a9659d8d24dc7 scsi: core: pm: Only runtime resume if necessary $ PAGER= git log --author="Bart Van Assche" --reverse --format=oneline v5.15..v5.17.1 drivers/scsi/sd.c a19a93e4c6a98c9c0f2f5a6db76846f10d7d1f85 scsi: core: pm: Rely on the device driver core for async power management 1c9575326a4a09bdb82c104ed1ab4e932c7a1f49 scsi: sd: Rename sd_resume() into sd_resume_system() 9131bff6a9f1d4c6ab2d4673974a9659d8d24dc7 scsi: core: pm: Only runtime resume if necessary db33028647a3eca9e3e6fccf170d75f3b56a466c scsi: Remove superfluous #include <linux/async.h> directives In other words, the patch "scsi: core: pm: Rely on the device driver core for async power management" has been included in both kernel versions and the fix for that patch ("scsi: sd: Rework asynchronous resume support") is included in neither v5.16.15 nor in v5.17.1. My conclusion is that it is *unlikely* that https://bugzilla.kernel.org/show_bug.cgi?id=215880 is related to this issue. Damien, I found the following in comment #1: [174233.228546] ata3: SATA link down (SStatus 0 SControl 300) [174238.608510] ata3: SATA link down (SStatus 0 SControl 300) [174238.608521] ata3.00: disable device [174238.608540] sd 2:0:0:0: rejecting I/O to offline device [174238.608547] ata3.00: detaching (SCSI 2:0:0:0) [174238.608554] sd 2:0:0:0: [sdb] Start/Stop Unit failed: Result: hostbyte=DID_NO_CONNECT driverbyte=DRIVER_OK [174238.608558] sd 2:0:0:0: PM: dpm_run_callback(): scsi_bus_resume+0x0/0x80 returns -5 [174238.608567] sd 2:0:0:0: PM: failed to resume async: error -5 [174238.630402] sd 2:0:0:0: [sdb] Synchronizing SCSI cache [174238.630423] sd 2:0:0:0: [sdb] Synchronize Cache(10) failed: Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK [174238.630424] sd 2:0:0:0: [sdb] Stopping disk [174238.630428] sd 2:0:0:0: [sdb] Start/Stop Unit failed: Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK Do you perhaps know what can cause a START STOP UNIT command to be sent to a SATA disk before the SATA link is up?
(In reply to Bart Van Assche from comment #28) > Now that I have had the time to read all the messages above, the following > information looks very interesting to me: > > Linux 5.16.15-1-default - hdb not lost during resume from S3. > Linux 5.17.1-1-default - hdb lost during resume from S3. > > The following changes from me have been included in the sd driver in each of > these kernel versions: > > $ PAGER= git log --author="Bart Van Assche" --reverse --format=oneline > v5.15..v5.16.15 drivers/scsi/sd.c > a19a93e4c6a98c9c0f2f5a6db76846f10d7d1f85 scsi: core: pm: Rely on the device > driver core for async power management > 1c9575326a4a09bdb82c104ed1ab4e932c7a1f49 scsi: sd: Rename sd_resume() into > sd_resume_system() > 9131bff6a9f1d4c6ab2d4673974a9659d8d24dc7 scsi: core: pm: Only runtime resume > if necessary > > $ PAGER= git log --author="Bart Van Assche" --reverse --format=oneline > v5.15..v5.17.1 drivers/scsi/sd.c > a19a93e4c6a98c9c0f2f5a6db76846f10d7d1f85 scsi: core: pm: Rely on the device > driver core for async power management > 1c9575326a4a09bdb82c104ed1ab4e932c7a1f49 scsi: sd: Rename sd_resume() into > sd_resume_system() > 9131bff6a9f1d4c6ab2d4673974a9659d8d24dc7 scsi: core: pm: Only runtime resume > if necessary > db33028647a3eca9e3e6fccf170d75f3b56a466c scsi: Remove superfluous #include > <linux/async.h> directives > > In other words, the patch "scsi: core: pm: Rely on the device driver core > for async power management" has been included in both kernel versions and > the fix for that patch ("scsi: sd: Rework asynchronous resume support") is > included in neither v5.16.15 nor in v5.17.1. My conclusion is that it is > *unlikely* that https://bugzilla.kernel.org/show_bug.cgi?id=215880 is > related to this issue. > > Damien, I found the following in comment #1: > > [174233.228546] ata3: SATA link down (SStatus 0 SControl 300) > [174238.608510] ata3: SATA link down (SStatus 0 SControl 300) > [174238.608521] ata3.00: disable device > [174238.608540] sd 2:0:0:0: rejecting I/O to offline device > [174238.608547] ata3.00: detaching (SCSI 2:0:0:0) > [174238.608554] sd 2:0:0:0: [sdb] Start/Stop Unit failed: Result: > hostbyte=DID_NO_CONNECT driverbyte=DRIVER_OK > [174238.608558] sd 2:0:0:0: PM: dpm_run_callback(): scsi_bus_resume+0x0/0x80 > returns -5 > [174238.608567] sd 2:0:0:0: PM: failed to resume async: error -5 > [174238.630402] sd 2:0:0:0: [sdb] Synchronizing SCSI cache > [174238.630423] sd 2:0:0:0: [sdb] Synchronize Cache(10) failed: Result: > hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK > [174238.630424] sd 2:0:0:0: [sdb] Stopping disk > [174238.630428] sd 2:0:0:0: [sdb] Start/Stop Unit failed: Result: > hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK > > Do you perhaps know what can cause a START STOP UNIT command to be sent to a > SATA disk before the SATA link is up? Resume issue the start stop unit command, which triggered the link-up/resume code in ATA, which failed, causing the start stop unit command to fail. That command is not sent "after" the fact that the link is down. The problem here is that the device PHY/link are not ready. Bart patches addressed a similar problem found on another system. It would be really great if you could try them to see if they fix your issue.
I would love to be able to test the kernel! However, I cannot install code or software from another source than the given one (signed bu openSUSE and everything). I hope the patch for 215880 will be merged soon. Us being on a rolling release I will likely be able to test it soon after. Will let you know.
I can confirm that this issue still exists in Linux 5.19.0.
Please share the output of the following command: find /sys -name proc_name | xargs grep -aH .
(In reply to Bart Van Assche from comment #32) > Please share the output of the following command: > > find /sys -name proc_name | xargs grep -aH . # find /sys -name proc_name | xargs grep -aH . /sys/devices/pci0000:00/0000:00:01.2/0000:01:00.0/0000:02:09.0/0000:0a:00.0/ata1/host0/scsi_host/host0/proc_name:ahci /sys/devices/pci0000:00/0000:00:01.2/0000:01:00.0/0000:02:09.0/0000:0a:00.0/ata4/host3/scsi_host/host3/proc_name:ahci /sys/devices/pci0000:00/0000:00:01.2/0000:01:00.0/0000:02:09.0/0000:0a:00.0/ata2/host1/scsi_host/host1/proc_name:ahci /sys/devices/pci0000:00/0000:00:01.2/0000:01:00.0/0000:02:09.0/0000:0a:00.0/ata3/host2/scsi_host/host2/proc_name:ahci /sys/devices/pci0000:00/0000:00:01.2/0000:01:00.0/0000:02:0a.0/0000:0b:00.0/ata6/host5/scsi_host/host5/proc_name:ahci /sys/devices/pci0000:00/0000:00:01.2/0000:01:00.0/0000:02:0a.0/0000:0b:00.0/ata5/host4/scsi_host/host4/proc_name:ahci /sys/devices/pci0000:00/0000:00:08.2/0000:0f:00.1/ata8/host7/scsi_host/host7/proc_name:ahci /sys/devices/pci0000:00/0000:00:08.2/0000:0f:00.0/ata7/host6/scsi_host/host6/proc_name:ahci This is after an S3 suspend which means sdb2 is invisible.
PS: sdb I mean. The second hard disk.
I can confirm that this issue still exists in Linux 5.19.8.
Sorry, wrong status, was looking for CONFIRMED but VERIFIED is not CONFIRMED :-(
Could you try this patch ? https://lore.kernel.org/linux-ide/20220921155753.231770-1-niklas.cassel@wdc.com/ You did mention that the problem is on a production machine. I hope you can test this on a similar machine not in production.